There was a time when the only people with the drive and resources to push the limits of data analysis were scientists and Wall Street financial wizards. But ever since the Apache Hadoop software library hit the scene, the costs involved in working with big data have decreased and the use cases have increased.
The net result of "big data" growth, according to analyst firm McKinsey Research, is that data generation worldwide is increasing at about 40% per year, while IT spending is only expected to rise by 5% to meet that demand. Once again, the enterprise architect is being asked to do more with less. Still, there are many tools for meeting businesses' big data needs.
Apache Hadoop is an open source big data analytics engine based on the MapReduce model made famous by software engineers at Google. This Java-based software framework is an important tool in the fight to corral and analyze huge barrages of information from both structured and unstructured sources. Although it does not focus on real-time business analytics and is restricted to batch processing, Hadoop provides methods for drawing pertinent information out of enormous sums of data and can be an instrumental piece to a proper business intelligence system.
In-memory data grids are poised to play an important role in big data analytics in coming years. While this technology was once rough-cut, the field has grown and commercial offerings have improved, adding features that cater to users' needs. In-memory data grids target enterprises that have fast-changing big data and are therefore unable to effectively implement a batch-processing model such as Hadoop's MapReduce model. Enhancements in key APIs are simplifying the challenges involved in moving from batch data processing toward a data stream approach.
Like many open source software applications that different organizations use in slightly different ways, there are several different flavors of Hadoop that enterprises can choose from. Independent software vendors that wish to integrate Hadoop may find challenges in ensuring that all of these bases are covered. One application integration specialist is providing a driver that is meant to connect applications that currently support Open Database Connectivity (ODBC) with the power of Hadoop.
Leading a big data analytics project requires a host of skills, not least of which is knowledge of Apache Hadoop and the MapReduce model -- but there are other skills needed, too. According to IT skills specialists Foote Partners LLC, the list also includes HBase, Pig, Hive, Cassandra, MongoDB, CouchDB, XML, Membase, Java, .NET, Ruby, C++ and more. An ideal candidate should also be comfortable with analytics, high-speed computing, statistics and sophisticated algorithms.
Some experts see cloud infrastructure as an essential component of successful big data analytics. While they recognize the importance of Hadoop integration and the MapReduce approach to dividing and conquering terabytes of data in a matter of minutes, these experts claim that traditional server infrastructures are prohibitively expensive to scale up to that point. The intersection of big data with the cloud is where things get really interesting.
This was first published in September 2012