Skip to main content Accessibility help
×
Hostname: page-component-7c8c6479df-fqc5m Total loading time: 0 Render date: 2024-03-19T02:02:49.476Z Has data issue: false hasContentIssue false

13 - Processing Large Data Sets in MapReduce

from Part III - Big Data Applications

Published online by Cambridge University Press:  18 May 2017

Zhu Han
Affiliation:
University of Houston
Mingyi Hong
Affiliation:
Iowa State University
Dan Wang
Affiliation:
Hong Kong Polytechnic University
Get access

Summary

Introduction

While the performance of a data parallelism structure is primarily determined by the amount of data that need to be processed, since data are processed by different machines in parallel, the completion time of an application job is determined by the last finished machine. It is a grand challenge to distribute data load evenly to different machines. In the default configuration ofMapReduce Hadoop, this is done by developing a hash function in the shuffling phase so that the intermediate results outputted by the mapper phase will be distributed evenly to different machines in the reduce phase. Such default configuration emphasizes the “common” case. Clearly, it is not optimal for various scenarios, and in many situations such configuration performs poorly.

There are many research studies on this grand problem. There are the reactive approach and the proactive approach. In the reactive approach, the workloads of different machines are monitored and workloads can be migrated from one machine to another in case there is a large skew of the workloads. SkewTune is one good example representing reactive approach [22]. In the proactive approach, systems try to understand the possible workloads the data to be processed can generate on each machine, so that when the data are dispatched to different machines in a load aware manner.

In this chapter, we study a proactive approach to improve performance of processing large data in MapReduce using sublinear algorithms.

The Data Skew Problem of MapReduce Jobs

When running MapReduce jobs, most of the studies have assumed that the input data are of uniform distribution, which, often being hashed to reduce worker nodes, naturally leads to a desirable balanced load in the later stages. The real-world data, however, are not necessarily uniform, and often exhibit a remarkable degree of skewness. For example, in PageRank, the graph commonly includes nodes with much higher degrees of incoming edges than others [21], and in Inverted Index, certain content can appear in many more documents than others [22]. Such a skewed distribution of the input or intermediate data can result in a small number of mappers or reducers taking a significantly longer time to complete than others [19]. Recent experimental studies [21, 19, 22] have shown that, in the CloudBurst application with a biology data set of a bimodal distribution [635], the slowest map task takes five times as long to complete as the fastest.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×