MapReduce in the browser helps collaborative supercomputing

Recently a number of different developers have proposed techniques for running MapReduce directly in the browser to significantly lower the barrier for swarm computing.

Collaborative MapReduce in the browser lowers the barriers for collaborative supercomputing.

The idea of enrolling a large audience of people to donate computer time to worthwhile computing projects achieved widespread fame when UC Berkeley rolled out the SETI@home project in 1999 to search of extraterrestrial intelligence. Recently a number of different developers have proposed techniques for running MapReduce directly in the browser to significantly lower the barrier for swarm computing.

Since the initial release of SETI@home, millions of consumers have donated spare CPU time on their computers and in the process have created what has been claimed to be the largest computer on the planet. A more general purpose version called the Berkeley Open Infrastructure for Network Computing (BOINC) has been created for virtually any kind of computation including searching for the cure for cancer, creating better climate models, looking for gravitational waves, and providing clean energy. Almost 4 million computers have participated in BOINC, enabling over 1.5 PetaFLOPS of performance, compared to 500 TeraFLOPS for Blue Gene, the largest supercomputer in the world. Meanwhile, MapReduce has been getting a lot of attention since it was first publicly announced by Google in 2004. As noted earlier in an article on SearchSOA, MapReduce is useful in three categories of applications: Text tokenization, indexing, and search; Creation of other kinds of data structures (e.g., graphs); and, Data mining and machine learning. Data processing requiring massive parallel processing is ideally suited for capabilities of MapReduce.

MapReduce creates a framework for mapping a computation to run across thousands of low-cost PCs, and then reducing, or reassembling the individual computations into a final answer. Although Google has not publicly disclosed its own implementation, it has gained widespread attention in the development community with a variety of open source implementations including Hadoop, GridGain, Skynet, and Disco. At the same time, both Greenplum and Aster Data Systems have released commercial versions.

Recently at least two approaches have been discussed for enabling MapReduce applications to be run across browsers using JavaScript. These approaches help lower the barriers for enrolling new users in a particular computation while using MapReduce to simplify the programming model required to execute the computation.

Sean McCullough was the first to mention the general concept in January. He describes a basic technique to write programs in the MapReduce style, but does not elaborate on how to distribute and reassemble a computation across a group of machines.

In March, Ilya Grigorik elaborated on a more comprehensive approach for creating a distributed computing application using MapReduce running across JavaScript browsers. He noted, "It just so happens that there are more JavaScript processors around the world (every browser can run it) than for any other language out there - a perfect data processing platform." With a low barrier to entry, he believes it would be possible to enroll millions of users to solve a whole class of problems previously unachievable.

Although the technique has considerable promises, it also faces numerous challenges around security, economics and speed, McCullough later wrote. Workers could intentionally poison the jobs if they have an incentive to. How do you know if you can trust a worker? In the case of SETI@home, some individuals gamed the system so they could rank higher and gain more status in the quest for extraterrestrials.

McCullough also points out that the large data sets in a MapReduce operation will cost significant amounts of money to move back and forth across the Internet, and even if the bandwidth costs are not an issue, the run times will be adversely affected by the delays over the Internet. He notes, "So until someone does a lot of legwork to sort out the basic m/r infrastructure and then tackles the additional problems introduced by running on an open, slow, expensive network connection, the JavaScript MapReduce over HTTP idea is just a (admittedly fun) toy."

In a later post, Grigorik noted that the subject of performance and scalability generated a lot of conversation. There are concerns that the job servers are a single point of failure, and the non-stateful nature of HTTP introduces the need for a storage layer. However, he believes that many of these problems have already been addressed in the P2P community with protocols such as BitTorrent and Distributed Hash Tables.

Others are concerned about the speed of JavaScript. But Grigorik notes that the extremely low barrier to entry and wide availability of browsers will enable the dawn of Swarm Computing in which ad-hoc swarms of computers can descend on a problem and contribute thousands of CPU hours in a matter of minutes, and then vanish a few seconds later.

A number of vendors such as 80Legs and PluraProcessing are starting to commercialize massive distributed computing models to provide a low cost computing model, although not using MapReduce or JavaScript. PluraProcessing has developed a distributed computing code that game developers can incorporate into Java-based games. It prays developers $2.60 for each full month of CPU time provided. Organizations can lease out this power for about one/tenth the cost of traditional cloud based services, or grid compute hardware.


This was first published in May 2009

Dig deeper on Open source Web services

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

SearchSoftwareQuality

SearchCloudApplications

SearchAWS

TheServerSide

SearchWinDevelopment

Close