The Amazon Elastic MapReduce service is an implementation of Hadoop on top of the AWS platform. It was created to simplify the rollout of new MapReduce applications and thus make this technology available to a larger audience. Elastic MapReduce enables more people to run, monitor, and control Hadoop jobs by using a point-and-click interface.
Under the hood
A MapReduce instance consists of a single master node, and multiple slave nodes used to execute the mapping and reducing algorithm. There are two types of slave nodes. Core nodes are used to manage the data in the distributed file system. The task nodes execute the processes.
Amazon has recently added the ability to adjust the number of servers in an Elastic MapReduce instance on the fly. Once a process has started, you can increase but not reduce the number of core nodes. You can dynamically increase or decrease the number of task nodes as required.
Programming the MapReduce workflow
Developers can interact with elastic MapReduce via command line tools, the API, or the AWS management console. The API and command line tools allow the most automation and fine grained control. These can be used to create special job flow or monitoring steps. The Web console is better suited for watching the progress of a job or launching or stopping a job flow from a web browser.
There are a variety of tools to help debug new MapReduce instances. The debug job flow window can be accessed via the AWS management console. This can be used to track progress and indentify issues. You can also telnet into the AWS server and use your favorite command line debugger to analyze the job flow. During the development phase, you are going to want to enable debugging by setting the "Enable Debugging" flag when you create a new job flow with the AWS Management Console. In command line mode, just pass "--enable-debugging and --log-uri" when a job flow is created.
One of the biggest challenges with MapReduce is the limited support for legacy code and programming methodologies. Amund Tviet said that developers can use Boto on top of Python to simplify the integration of Elastic MapReduce with other web services. He said this kind of integration opens new doors for parallelizing legacy code.
New MapReduce instances are slow to boot up, noted Joel Duffin. This can be significantly reduced by keeping EC2 running. This can prove time saving during the development cycle when new instances are repeatedly kicked off, explained Duffin. To avoid that start up time, keep EC alive by adding the following string: elastic-mapreduce -create -alive -log-uri s3://my-example-bucket/logs.
This was first published in October 2010