Tip

Getting the most from Elastic MapReduce

MapReduce is a technique pioneered by Google for distributing applications across clusters of commodity hardware. It's gaining popularity for its ability to process massive log files. The Hadoop implementation of MapReduce is being used to process petabytes of data. Researchers believe that it also promises to enable a new paradigm for programming analytic models. MapReduce applications are being used for web indexing, data mining, log file analysis, financial analysis, scientific simulation and bioinformatics research.

The Amazon Elastic MapReduce service is an implementation of Hadoop on top of the AWS platform. It was created to simplify the rollout of new MapReduce applications and thus make this technology available to a larger audience. Elastic MapReduce enables more people to run, monitor, and control Hadoop jobs by using a point-and-click interface.

Under the hood

A MapReduce instance consists of a single master node, and multiple slave nodes used to execute the mapping and reducing algorithm. There are two types of slave nodes. Core nodes are used to manage the data in the distributed file system. The task nodes execute the processes.

Amazon has recently added the ability to adjust the number of servers in an Elastic MapReduce instance on the fly. Once a process has started, you can increase but not reduce the number of core nodes. You can dynamically increase or decrease the number of task nodes as required.

Changes

    Requires Free Membership to View

to a workflow can be made through the Elastic MapReduce interface, the command line or a Java SDK. For example a predefined workflow in an application might reduce the number of task nodes as an application moves to a different task with lower processing needs. These same tools can also be used to kick off new slave nodes in the event of a failure.

Programming the MapReduce workflow

Developers can interact with elastic MapReduce via command line tools, the API, or the AWS management console. The API and command line tools allow the most automation and fine grained control. These can be used to create special job flow or monitoring steps. The Web console is better suited for watching the progress of a job or launching or stopping a job flow from a web browser.

There are a variety of tools to help debug new MapReduce instances. The debug job flow window can be accessed via the AWS management console. This can be used to track progress and indentify issues. You can also telnet into the AWS server and use your favorite command line debugger to analyze the job flow. During the development phase, you are going to want to enable debugging by setting the "Enable Debugging" flag when you create a new job flow with the AWS Management Console. In command line mode, just pass "--enable-debugging and --log-uri" when a job flow is created.

One of the biggest challenges with MapReduce is the limited support for legacy code and programming methodologies. Amund Tviet said that developers can use Boto on top of Python to simplify the integration of Elastic MapReduce with other web services. He said this kind of integration opens new doors for parallelizing legacy code.

New MapReduce instances are slow to boot up, noted Joel Duffin. This can be significantly reduced by keeping EC2 running. This can prove time saving during the development cycle when new instances are repeatedly kicked off, explained Duffin. To avoid that start up time, keep EC alive by adding the following string: elastic-mapreduce -create -alive -log-uri s3://my-example-bucket/logs.

This was first published in October 2010

There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.