An architect's guide: How to use big data
A comprehensive collection of articles, videos and more, hand-picked by our editors
Data has value and companies can't live without it, but just how much is data worth? How much does it cost to analyze big data and derive value from it?
In the past, technologists provided upper management with historical data so they could spot market trends. Statistics -- while useful for getting high-level views of market trends and for how an organization is doing in the market -- are not sufficient for determining what new products and services to create. These statistics don't tell what customers actually want.
Analysts, researchers and business users analyze big data to make better and faster decisions. Using advanced analytics techniques, such as text analytics, machine learning, predictive analysis, data mining and statistics, businesses can analyze previously untapped data.
Companies generate large amounts of data and have the capability to collect information from other sources, such as mobile applications, sensors, websites, clickstream data and social media activity. This data can be turned into a product.
Collecting and analyzing large amounts of data, primarily unstructured data, is not an easy task. Current company systems are not equipped to process 500 terabytes of data per week to glean the nuggets that can help companies create new products and services that customers want. This has led companies to look at high-performance computing (HPC) resources capable of solving problems, such as weather and climate forecasting, parametric modeling and stochastic modeling to process huge amounts of commercially oriented data.
Big data analytics is the use of analytic techniques against very large, diverse data sets that include different data types such as structured/unstructured data, streaming or batch data, and different sizes that vary from terabytes to petabytes to zettabytes. It examines different data types to uncover hidden patterns, unknown correlations and other useful information.
The aforementioned information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. High-performance data analysis (HPDA) is the term adopted to describe the conversion of the data-intensive HPC market and the high-end commercial data analytics market.
HPC, big data and traditional IT collide
HPC offerings are generally run on large and costly supercomputers with hundreds or even thousands of servers. HPC software and hardware may be specially architected to solve a narrow class of problems unsuitable for big data and traditional IT. The Top500 list of supercomputers offers insight into the HPC applications running on large systems classified as supercomputers.
Big data requires supercomputing-like capabilities provided by HPC combined with scheduling and optimization software that can manage numerous jobs over multiple environments simultaneously. This enables enterprises to leverage HPC-like computing while optimizing an existing diverse infrastructure. The problem with trying to marry HPC and HPDA technologies is that the folks working in each area are not very familiar with the other's technology.
Technical issues around HPC and HPDA
A concern with trying to use special purpose HPC architectures for HPDA is the need to adapt existing software or develop new software that can consume time and resources. Since big data analytics may not fit in the traditional data warehouse- or business intelligence-type data mode, traditional data warehouses may not be able to handle the processing demands.
As a result, big data technology has emerged and is being used in many big data analytics environments. The technologies include NoSQL databases, Hadoop and MapReduce. These technologies form the core of open source software that supports the processing of large data sets across clustered systems.
Budget constraints are limiting access to necessary compute resources at the time when explosive growth in data makes access increasingly desirable. Some hardware processor vendors, such as Intel with its Xeon Phi coprocessor, are working to provide a breakthrough in heterogeneous computing. The Xeon Phi coprocessor delivers good throughput and energy efficiency without high costs, inflexibility and programming challenges.
Some believe that the way companies must tackle the big data problem is with big data workflow. The big data workflow approach utilizes all available resources within the data center. In big data workload environments, specialized applications analyze, dissect and refine data into reports and new data sets. Analysts and executives make decisions accordingly, adjust data input requests, and the entire process begins again.
A big data workflow is constructed of multiple applications and workloads that may interact with large input data sets and generate other data sets as an output. The net effect is a complex web of data access and processing, representing a different degree and range of access that traditional storage systems are not built to handle.
Traditional storage systems innovated around delivering large capacity for archiving, or in delivering high performance for enterprise storage systems. Big data analytics processes high-throughput streaming read patterns: accessing large fragments of data, write streaming as data is created, and accessing random I/O for further analysis. Traditional storage systems can't scale to support required capacity and simultaneous access.
Emerging big data problems are exposing the limitations of current HPC-architectured computers. Most of the HPC platforms are compute-centric, lacking the superb storage and I/O (data movement) capabilities important for big data processing. The problem is that the HPC systems may spend a small number of compute cycles to compute a result and then spend hundreds of cycles to move the results through the system. Big data needs to continuously process large and growing volumes of information, requiring fast and frequent data movement between application servers, network connections and across storage.
The HPC community is planning to address HPDA challenges, such as data movement, by reducing data movement at all levels via in-memory processing or accelerating data movement via more capable fabrics and interconnect networks. This will lead to improved core-to-core communications.
HPDA and clouds
No single HPC architecture is best to manage or analyze big data analytics workloads. Heterogeneous computing is necessary to proceed productively. Moving heterogeneous HPC resources to the cloud is one way some organizations may be able to afford access to the latest compute power. Commercial cloud vendors, such as Amazon, are adding various HPC elements.
Public clouds are useful primarily for embarrassingly parallel HPC jobs and are much less effective on jobs requiring major interprocessor communications via MPI or other protocols. Therefore, highly parallel HPDA problems can be attractive problems for public clouds. HPDA cloud usage is expanding to include even less partitionable problems such as graph analytics, as long as the problems do not have to be solved in real time.
Commercial companies, including small companies and start-ups, are at the forefront of a trend toward taking HPDA problems directly to public clouds and avoiding the capital expense needed to build on-premises data centers. There can be large costs involved in moving and protecting large amounts of data transported to a public cloud.
Advice for analysis success
The introduction of HPDA into an IT organization requires a significant amount of planning and the introduction of HPC, HPDA and data storage experts. Experts from these areas need to be recruited and be willing to work together. This will require a new organization within an organizations IT environment.
The lack of a flexible HPDA solution within IT can constrain big data strategies. Big data has the potential to consume large amounts of data center space and power. If adequate space to expand isn't available, consider purchasing systems built with processors -- such as the Xeon Phi processor -- that are dense and power-efficient.
Public clouds should be considered to develop big data solutions so hardware doesn't need to be purchased for testing. When these solutions are placed into production, practical issues such as the cost and time required to move large quantities of data to the cloud may drive an organization to an on-premises solution for production.
High-end data storage products suitable for traditional enterprise applications may not be best for HPDA, given its specialized requirements. Petabyte-scale storage volumes potentially required for HPDA exceed the capabilities of traditional data protection solutions. Moving data protection copies through backup or replication may not be done for large amounts of data within limited timeframes.
Growing HPDA infrastructures can rapidly throw a storage infrastructure out of balance and drive up operating costs if the storage infrastructure scales ineffectively. Several storage companies, including EMC, NetApp and IBM, claim to offer storage solutions.
HPDA requires a lot of money to get started and keep going. Organizations have to weigh the cost of creating a HPDA infrastructure against the value from analyzing big data. The problem is that there may not be a choice, especially if competitors are spending the money for HPDA and getting results that help them develop products and services customers want.
About the author:
Bill Claybrook is a marketing research analyst with over 35 years of experience in the computer industry with the last dozen years in Linux, open source and cloud computing. Bill was research director, Linux and open source, at The Aberdeen Group in Boston and a competitive analyst/Linux product-marketing manager at Novell. He is currently president of New River Marketing Research and Directions on Red Hat. He holds a Ph.D. in Computer Science.