The tantalizing prospect of cloud computing is changing the way people in IT think. Instead of massive and ever growing data centers, it may now be possible to simply tap into potentially unlimited resources residing externally, in the "cloud." Of course, these visions have already begun to take tangible form in cloud services such as Amazon EC2 and Microsoft Azure. However, according to analysts and others, the potential of the cloud...
-- at least for data-intensive applications -- will be limited without the application of a crucial enabling technology -- distributed data grids. A distributed data grid, also called a distributed data cache, operates between the database and the in-memory of an application and provides a temporary repository for data, enhancing performance by improving access and eliminating bottlenecks.
Analyst Mike Gualtieri and his colleague at Forrester Research, John R. Rymer, have proposed an additional term -- elastic caching -- which captures a particularly useful characteristic of some data grids. Their recent report, The Forrester Wave: Elastic Caching Platforms, Q2 2010, describes the technology and some of the key vendors in the space.
Gualtieri says it is important to recognize that within the concept of data caching there are a range of solutions. The best way to describe a distributed cache is something that acts across one or more nodes. The reason Gualtieri terms some types of distributed cache elastic is that they can add and remove nodes while running. "And we think that is important and more descriptive of the defining characteristics of a data grid," he says.
By contrast, there are a number of potent but non-elastic distributed caching schemes, one of which is Memcached, an open source caching product that is widely used at Facebook and other Web properties. "Memcached is distributed but not elastic, you can decide that you have enough data so that you will require eight servers or 80 servers, but if it turns out you need more or fewer you have to shut down in order to add or remove them," he explains.
Many people associate cloud computing with scale, notes Gualtieri. Certainly the cloud allows you to scale instances of machines -- but you can't easily scale applications and data in the cloud because applications and data haven't been architected to take advantage of the "extra horsepower."
Likewise, if you think of a typical relational database packed with customer or order information, when it comes to the cloud, that database becomes your bottleneck. "If you are getting more and more transactions against that database you can try to speed it up by adding five more servers, but how do you split the data? You can't," says Gualtieri. "So elastic caching is really interesting because it has a huge impact for the cloud -- it is a solution for scaling data," he adds.
Because of its elasticity, nodes can be added in real time; if you start with four servers and add four more, these platforms will rebalance the data fairly evenly across the nodes and if any node goes down you are not down because they replicate the data. "So elastic caching also provides fault tolerance and high availability at a fraction of the cost of what it would take just to re-architect a database," he adds.
According to Gualtieri, the quest to deliver cloud scalability has also spawned a few other variations, notably the NO SQL movement. "On first glance it sounds like an attempt to get rid of SQL but the term actually stands for Not Only SQL," he says. Of course, he notes, traditional relational database are great at transactional integrity; they always provide consistent data.
By contrast, notes Gualtieri, the NO SQL crowd talks about a concept called eventual consistency. For example, when someone does an update on Twitter or Facebook, it isn't absolutely necessary that every user on the internet sees it that second -- as long as it arrives eventually. "It isn't like decrementing $100, you need a relational database for that," says Gualtieri.
For all the data that doesn't need absolute timeliness or consistency, NO SQL can provide that eventual consistency. "You give up some of the transactional integrity but what you get is an inexpensive way to scale a large amount of non-transactional data," he says.
Historically, notes Gualtieri, NO SQL grew out of the attempts by Amazon and EBay to master issues of scale. "What has happened over the years is that these technologies and similar ones have made their way into open source projects," one of which -- Cassandra -- is an open source NO SQL "that is very much like elastic caching in that the data is distributed, spread across multiple nodes, and it is fault tolerant," he explains. However, he adds, in general, NO SQL is not as well defined or developed as elastic caching -- and most NO SQL products are open source.
Elastic caching products
Coming back to the elastic caching vendors, Gualtieri's report pegs IBM (WebSphere eXtreme Scale), Terracota (Ehcache FX edition), GigaSpaces (XAP), and Oracle (Coherence) as leaders in the sector.
Cameron Purdy, vice president of development at Oracle says "The goal with Coherence is to dramatically simplify the usage of the software and the learning curve for building, deploying and operating data grids, regardless of their size."
Purdy reiterates the point that if a cloud is to achieve true capacity-on-demand, it needs to be able to shift its server footprint by tens, hundreds or even thousands of servers. "Any time an application is running on more than one server, it has state that it has to manage across those servers, and that is what a data grid enables, and specifically that is what Coherence enables," he says.
Purdy says Oracle Coherence is not a relational database management system (RDBMS). Instead, Coherence manages application data in the form that the application works with that data, such as "objects" in languages such as Java, C# and C++. Oracle Coherence manages live application data (application state, sessions, caches, etc.) in memory, using multiple servers to provide both scalability and availability in managing that application data. Furthermore, Coherence does so automatically as the server footprint grows and shrinks, all without loss of data or interruption of service.
The Forrester report also named a second tier of "strong performers," namely GemStone Systems (GemFire), Alachisoft (NCache), and ScaleOut Software (State Server).
William Bain, founder and CEO of ScaleOut Software, and a veteran of the parallel computing industry, says "We combine distributed caching with parallel data analysis." Bain says elasticity is central to the idea of distributed caching. "What distributed caching offers is the ability to have lots of threads on lots of servers accessing a common pool of data -- and as applications grow in scope they have the ability to have scalable storage," he says.
Like Gualtieri, Bain says Memcached "runs out of steam" when the data is being updated rapidly and being read, as occurs in e-commerce shopping carts. "In a distributed cache it can be read and updated by many web servers and web farms -- and it can scale," he says.
Advice on data caching
Bain says CTOs and their architects should make an effort to come up to speed on distributed caching and data grids and understand the power they provide in scaling application states. "By becoming savvy on this technology they will be able to move applications to the cloud and scale seamlessly across a large pool of virtual servers," he says. On the other hand, Bain predicts that those who don't take advantage of distributed caching will find it is hard to eliminate all the bottlenecks to scalability and storage in multiserver environments.
Expanding on that concept, Purdy says in order for an application to take full advantage of a data grid, the application should have a strong domain model. In other words, the information that the application consumes, creates and uses internally to run should be expressed as -- and managed as -- a set of defined entities with well-understood relationships among those entities.
"In modern object oriented languages such as Java, these entities are typically implemented as Java classes, and are often referred to simply as domain objects," he notes. According to Purdy, applications that have a strong domain model tend to be very easy to move to a data grid, while applications that do not encapsulate data access and representation -- such as those that sprinkle direct SQL database access throughout the application -- are difficult to adapt to data grids.
Summarizing, Gualtieri says his general advice is if you are considering cloud computing, you must ask if the application architecture is elastic and then ask three follow-up questions. The first is how can you scale your database? The second is how can your data shrink and grow to take advantage of the cloud, and how can your application code shrink and grow? The third question is about performance -- what is your performance strategy in the cloud?
"The reason I mention performance is because people hear about the opportunity to have more instances in the cloud, but more instances doesn't necessarily translate into more performance because you won't necessarily know what platforms you are running on," he says. In fact, he advises planning to do load and performance testing in the cloud. "You can't assume that just because it is in the cloud it will perform better -- start with the data first, because that is going to be a bottleneck," he adds.