Processing Large Data Sets in MapReduce

Zhu Han; Mingyi Hong; Dan Wang

doi:10.1017/9781316408032.013

13 - Processing Large Data Sets in MapReduce

from Part III - Big Data Applications

Published online by Cambridge University Press: 18 May 2017

Zhu Han ,

Mingyi Hong and

Dan Wang

Show author details

Zhu Han: Affiliation:
University of Houston
Mingyi Hong: Affiliation:
Iowa State University
Dan Wang: Affiliation:
Hong Kong Polytechnic University

Book contents

Get access

Summary

Introduction

While the performance of a data parallelism structure is primarily determined by the amount of data that need to be processed, since data are processed by different machines in parallel, the completion time of an application job is determined by the last finished machine. It is a grand challenge to distribute data load evenly to different machines. In the default configuration ofMapReduce Hadoop, this is done by developing a hash function in the shuffling phase so that the intermediate results outputted by the mapper phase will be distributed evenly to different machines in the reduce phase. Such default configuration emphasizes the “common” case. Clearly, it is not optimal for various scenarios, and in many situations such configuration performs poorly.

There are many research studies on this grand problem. There are the reactive approach and the proactive approach. In the reactive approach, the workloads of different machines are monitored and workloads can be migrated from one machine to another in case there is a large skew of the workloads. SkewTune is one good example representing reactive approach [22]. In the proactive approach, systems try to understand the possible workloads the data to be processed can generate on each machine, so that when the data are dispatched to different machines in a load aware manner.

In this chapter, we study a proactive approach to improve performance of processing large data in MapReduce using sublinear algorithms.

The Data Skew Problem of MapReduce Jobs

When running MapReduce jobs, most of the studies have assumed that the input data are of uniform distribution, which, often being hashed to reduce worker nodes, naturally leads to a desirable balanced load in the later stages. The real-world data, however, are not necessarily uniform, and often exhibit a remarkable degree of skewness. For example, in PageRank, the graph commonly includes nodes with much higher degrees of incoming edges than others [21], and in Inverted Index, certain content can appear in many more documents than others [22]. Such a skewed distribution of the input or intermediate data can result in a small number of mappers or reducers taking a significantly longer time to complete than others [19]. Recent experimental studies [21, 19, 22] have shown that, in the CloudBurst application with a biology data set of a bimodal distribution [635], the slowest map task takes five times as long to complete as the fastest.

Information

Type: Chapter
Information: Signal Processing and Networking for Big Data Applications , pp. 283 - 300

DOI: https://doi.org/10.1017/9781316408032.013 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

Accessibility standard: Unknown

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

Accessibility compliance for the PDF of this chapter is currently unknown and may be updated in the future.