Choosing storage for streaming large files in big data sets
A comprehensive collection of articles, videos and more, hand-picked by our editors
In recent years, software architects have begun to employ distributed in-memory data grids to turbocharge data access. The early versions of these grids were somewhat raw, but their capabilities have grown, and new traits have been added in response to user needs. They are poised now to play an important role in some significant shifts in software architecture.
Many of the new traits of in-memory data grids (IMDGs) target apps that fall under the ambiguous umbrella called "big data." According to industry analyst firm Gartner Inc., IMDGs are suited to handle big data's big-three Vs. First, they support the velocity needs of big data. That is, IMDGs support hundreds of thousands of in-memory data updates per second. Second, like NoSQL data stores, they can support big data variability. Finally, they can be clustered and scaled in ways that support large volumes of data.
In-memory data technology is no longer obscure. In the form of an in-memory database (IMDB), it is a pivotal part of industry power house SAP's effort to bring advanced technology to bear on customers' issues. The company's HANA IMDB is important to its recent efforts. IBM, Microsoft, Oracle and others also have fielded IMDB technology.
Recent releases of commercial data grid technology bear out the interest in supporting big data apps. Take these releases as examples:
- ScaleOutSoftware Inc.'s ScaleOut StateServer Version 5 adds global data integration that combines data grids at multiple sites into a single, globally accessible, "virtual" data grid -- not a bad trait in the era of cloud computing. It also supports enhanced parallel query, which enables fast queries of grid databases on properties.
- GigaSpaces Technologies Inc.'s XAP 9.0 has hooks for handling massive data sets. Enhancements include support for sorting events according to properties. Object properties can be handled in native, binary and compressed modes. In addition, the software supports GigaSpaces' Cloudify open source Platform-as-a-Service stack for managing Hadoop in cloud computing environments.
- Terracotta Inc.'s BigMemory. With Version 3.7, the capacity of BigMemory has increased by a factor of ten, and customers can add servers linearly as needed. The performance of its in-memory search has improved, and there are secure SSL communications on all endpoints. BigMemory works with Terracotta's Ehcache.
Applications that are likely to benefit from IMDG advancements include financial-instrument pricing apps in banking, shopping carts in e-commerce, user-preference calculations in Web applications, and reservation systems in the travel industry. In addition, the grids underlie some advanced cloud applications.
In-memory data grids benefit fast-changing big data
Big data has different faces. Some big data shifts slowly; some changes fast. "In general, what we try to do is help people that have fast-changing data," said William Bain, founder and CEO of ScaleOut Software.
Moving all that big data around can be a problem for some companies. In-memory data grids can help data minimize movement. You have to look at "how you minimize the motion," Bain said. "Data motion can kill performance when doing analytics," he said. Better scalability is another benefit of IMDGs. "People today are facing issues of scale, especially when they mix Web front ends with enterprise software architecture. All these things lead to distributed caching and in-memory data grids," he added. "The trick is to limit the explosion of [application programming interfaces]APIs. That means, when you are trying to do something new, you try to see if it fits into existing APIs in a seamless way," he said.
Perhaps this will refresh your in-memory data cache
Advances in 64-bit and multi-core systems have made it fairly easy to store tens of gigabytes and even terabytes of data completely in memory, said Nati Shalom, chief technology officer at GigaSpaces. This has helped data caches and in-memory data grids move ahead. The early APIs for the in-memory caches could be described as "raw," but Shalom uses the term "simple."
"During the early stages of the technology evolution, memory-based caches exposed a simple key-value API and enabled fairly simple query with no transaction or advanced query semantics," Shalom told SearchSOA.com. "Programming to this interface was fairly simple and intuitive; however, it fit the simple use case of read-mostly side-cache," he said. There was, however, complexity in mapping complex queries, and synchronizing with an external database.
"As memory-based caches evolved into in-memory data grids, this complexity challenge was addressed by introducing better computability [via] the data grid API with standard SQL APIs such as [Java Persistence API]JPA, [Java Database Connectivity] JDBC and SQL, as well as new APIs designed for the new generation of Web and social applications, [to] expose 'schemaless APIs' and object graph APIs," Shalom said. Serious enhancements continue for the purposes of big data apps, as shown in the fact that GigaSpaces' most recent release allows object properties to be handled in native, binary and compressed modes.
After the data deluge, new systems use IMDG
The way that websites generate data -- logs, user sessions, social media messages and so on -- is amazing. This data generation and the general trend toward digitization are driving the big data push. Big data processing might well be what people "do" with the cloud computing architectures now being built out, making big data the more important cog.
"Big data is not like cloud -- it's a real problem," chides Gary Nakamura, general manager of Terracotta, which is a wholly owned subsidiary of Software AG. "People didn't run out to do cloud."
"Big data wouldn't exist if the [present] database or data warehouse could handle it," Nakamura said, noting that conventional databases cannot deliver performance and scalability simultaneously for the kinds of data described as "big." He said Terracotta's BigMemory 3.7 allows people to put more data in memory while using the same amount of space (due to compression).
"BigMemory is a general purpose data management solution. Our focus is around high and sustained performance, more and more scale and greater ease of operation," said Nakamura, in an e-mail message.
All and all, the big data deluge drives a new way of designing systems, one that is pliable to in-memory data grid technology, GigaSpaces' Shalom said. It is much more oriented to event processing and streams of data than it is to batches or jobs.
"The reality [big data] is a new way of thinking," Shalom said. The assumption of established batch processing, he emphasizes, is that data comes in bursts. "This is an assumption that is going to break. In the past, you got a window to do processing. You recorded a lot of data fast. Now we are moving to a case where that assumption doesn't work," he said. "Because we are getting streams of data rather than bursts of data, we need to process globally, and quickly. You don't have the window you used to. A lot of systems have to adopt and change. For a lot of companies, it's not even an option. "
Correction: An earlier version of this story incorrectly described SAP HANA as an IMDG. It is an IMDB.