Choosing storage for streaming large files in big data sets
A comprehensive collection of articles, videos and more, hand-picked by our editors
At the heart of key analytics advances today is big data, which may be viewed as a vast collection of structured and unstructured data, much of it culled from Web applications, server logs and social media sites. While big data applications are often associated with fast-moving organizations that can quickly act on real-time data feeds, big data and real time are not necessarily synonymous.
Industry experts point out that there is a real difference between big data at rest and big data in motion. To get it moving, outside help is needed.
While they are modern, distributed and parallel, MapReduce and Hadoop -- two open source technologies closely linked with big data -- are batch-oriented. It may surprise some people, but they are usually used when big data is at rest; that is, unless they are accompanied by fairly advanced middleware. In-memory data grids or databases, complex event processing (CEP) engines and low-latency messaging middleware are just a few types of application infrastructure software that will be applied as architects take on the challenge of putting big data in motion.
"Fast data" is not a single technology, but a spectrum of approaches, according to Tony Baer, analyst at British research group Ovum. Fast data encompasses high-performance, low-latency CEP applications, where data streams are processed in memory to detect otherwise indecipherably sophisticated patterns, Baer wrote earlier this year in a blog post.
As users become more familiar with big data, the need to accompany such massive pools of information with more advanced types of messaging middleware will grow, according to Roy Schulte, an analyst at Gartner Inc. CEP is important to big data, Gartner holds, because it can process incoming data quickly by temporarily storing information in computer main memory.
Weighing the system scaling
Big data presents a classic computer science I/O problem, one in which the voluminous "ingress" and "egress" become the key performance bottleneck. As is often the case, there may be a tendency to throw hardware at such a problem, not necessarily to good effect. The Hadoop framework is an example.
"People talk about the scale but not about the performance aspect of Hadoop," said Michael Kopp, technology strategist on the performance management team at Detroit-based Compuware Corp. "One aspect that strikes me is that people assume, because it is big data, that it is fast big data. If you look at Hadoop, you see it is batch-oriented. It's fast, but it will never be real-time."
Merv AdrianGartner analyst
And just because it is open source doesn't mean it will save companies money.
"People are struggling. Hadoop is not actually cheap and it's hard to manage, with many jobs running at different rates. Throwing more and more hardware at it makes the managing even harder," he said, indicating that some NoSQL and other systems in the big data market could come to look much like CEP systems -- ones that emphasize speed.
"CEP systems will have an important place in the whole discussion," he said. While he sees Hadoop and NoSQL development teams working hard to improve performance of queries and tuning the database, he said they too seldom optimize to effectively adapt to the way an application actually uses the data.
Enter high-performance messaging
Low-latency messaging is emerging as another middleware means for pushing big data faster. Although Wall Street financial applications are still the primary use case, high-performance messaging is positioned for broader use. Vendors offering such tools include IBM, Informatica, PrismTech, RTI, Red Hat, Software AG, Solace Systems, Tervela, Tibco and others.
Big data applications that tap into sensors or the so-called Internet of Things represent use cases that could take low-latency middleware beyond Wall Street apps. Such software has already been used in analytics applications covering aerospace, defense, power utilities and even parking systems, according to Angelo Corsaro, chief technology officer at PrismTech. Corsaro oversees work on the OpenSplice DDS, which supports the Object Management Group's Data Distribution Service (DDS) for real-time systems.
"Applications use OpenSplice to distribute and cache very high volumes of often swiftly changing data," he told SearchSOA.com in an email. "The border between some technologies is becoming a bit more fuzzy."
"In a sense, OpenSplice provides some CEP capabilities," he said, pointing to its support of content-based subscriptions that can resemble continuous queries in the CEP domain.
"Regardless of peripheral overlap, technologies will continue to specialize and integrate," he added.
There are elements of CEP that distinguish its use from that of big data, of course. CEP tends to work with small data sets, said Merv Adrian, a Gartner analyst. Still, he sees a variety of technologies on their way that will speed up big data as we now know it.
"Big data to date has not been a real-time marketplace. New ways are emerging, but as they say, there is some assembly required," Adrian said. "For now, Hadoop is a tool set for after the fact. It's backwards looking, like business intelligence was."
And real-time capabilities are what people will expect from big data, Adrian said. "This will happen quickly. There is pressure," he said.
Big data efforts already represent whole new architectures compared with existing data schemes, so much is riding on project outcomes. People are not going to go to the trouble of adding new architectures, said Adrian, to "look at what they did last year." More change is in store.
Follow us on Twitter: @SearchSOA.