Distributed Computing with MapReduce and Pig

Serge Abiteboul; Ioana Manolescu; Philippe Rigaux; Marie-Christine Rousset; Pierre Senellart

doi:10.1017/CBO9780511998225.017

16 - Distributed Computing with MapReduce and Pig

from Part 3 - Building Web Scale Applications

Published online by Cambridge University Press: 05 June 2012

Serge Abiteboul ,

Ioana Manolescu ,

Philippe Rigaux ,

Marie-Christine Rousset and

Pierre Senellart

Show author details

Serge Abiteboul: Affiliation:
INRIA Saclay – Île-de- France
Ioana Manolescu: Affiliation:
INRIA Saclay – Île-de- France
Philippe Rigaux: Affiliation:
Conservatoire Nationale des Arts et Metiers, Paris
Marie-Christine Rousset: Affiliation:
Université de Grenoble, France
Pierre Senellart: Affiliation:
Télécom ParisTech, France

Book contents

Get access

Summary

So far, the discussion on distributed systems has been limited to data storage, and to a few data management primitives (e.g., write(), read(), search(), etc.). For real applications, one also needs to develop and execute more complex programs that process the available datasets and effectively exploit the available resources.

The naive approach that consists in getting all the required data at the Client in order to apply locally some processing, often looses in a distributed setting. First, some processing may not be available locally. Moreover, centralizing all the information then processing it, would simply miss all the advantages brought by a powerful cluster of hundreds or even thousands machines. We have to use distribution. One can consider two main scenarios for data processing in distributed systems.

Distributed processing and workflow: In the first one, an application disposes of large data sets and needs to apply to them some processes that are available on remote sites. When this is the case, the problem is to send the data to the appropriate locations, and then sequence the remote executions. This workflow scenario is typically implemented using Web services and some high-level coordination language.

Distributed data and MapReduce: In a second scenario, the data sets are already distributed in a number of servers, and, conversely to the previous scenario, we “push” programs to these servers. Indeed, due to network bandwidth issues, it is often more cost-effective to send a small piece of program from the Client to Servers, than to transfer large data volumes to a single Client.

Information

Type: Chapter
Information: Web Data Management , pp. 339 - 363

DOI: https://doi.org/10.1017/CBO9780511998225.017 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

Accessibility standard: Unknown

Accessibility compliance for the PDF of this book is currently unknown and may be updated in the future.