MapReduce, a parallel programming and data execution architecture that Google and others have used to churn massive amounts of Web data, is now ready to move into the enterprise world. GreenPlum
MapReduce is loosely related to 'map' and 'reduce' functions associated with so-called functional programming methods. Google's MapReduce has a counterpart in the Hadoop open-source Java MapReduce implementation. Now, two commercial MapReduce implementations have come forward from GreenPlum and Aster Data.
"MapReduce sits on top of the global distributed file system Google has built, and it allows Google developers to write parallel applications that make use of the data," explained Scott Yara, president and co-founder of GreenPlum.
Curt Monash, analyst and strategic advisor to the software industry, was early to write on GreenPlum and Aster Data. He said that that MapReduce could be useful in three categories of applications. He dubs these: Text tokenization, indexing, and search; Creation of other kinds of data structures (e.g., graphs); and, Data mining and machine learning.
"MapReduce is most useful when you have lots of data to analyze," Monash told SearchSOA. "A lot of the use cases are at the biggest Web companies. But anybody with large analytic data processing tasks, anybody with data warehouses in the hundreds of terabytes should take a look at it."
Although it is useful in other data intensive applications, data processing requiring massive parallel processing are ideally suited for capabilities of MapReduce, he said.
"The absolute ideal use case is one where the job simply cannot get done unless a lot of processors are used in parallel," Monash said.
Yara at GreenPlum said the key thing MapReduce adds to his database product is the ability to analyze structured and unstructured data both inside and outside the database using parallel processing on commodity hardware.
"Customers have data that lives everywhere," Yara said. "It could live in Web services, it could live in files, it could live databases. People want to be able to write programs that make use of that data whether it's unstructured or structured. With MapReduce, you can use the GreenPlum database as a parallel data processing platform without having to use the declarative model of relational SQL.
While Monash prefers to speak of MapReduce as a technology for Grid computing, Yara looks to a time when it will be the "programming language of choice for the Cloud."
"The big thing about Cloud computing is the ability to operate in parallel," Yara said. "By integrating MapReduce into the database you can do that well beyond the means of traditional SQL."
Yara sees MapReduce entering the corporate computer world as part of a concept he calls "Enterprise Data Cloud." Enterprise themselves can start to build their own Cloud-based infrastructure, and offer that up as a utility to their business units, he explained.