Large-Scale Data Analytics with Python and Spark: A Hands-on Guide to Implementing Machine Learning Solutions

Isaac Triguero; Mikel Galar

doi:10.1017/9781009318242

Chapter 2: MapReduce

pp. 19-42

Isaac Triguero

, University of Nottingham,

Mikel Galar

, Public University of Navarre

Get access

Add bookmark
Cite
Share

Extract

MapReduce is a parallel programming model that follows a simple divide-and-conquer strategy to tackle big datasets in distributed computing. This chapter begins with a discussion of the key distinguishing features and differences of MapReduce with respect to similar distributing computing tools like Message Passing Interface (MPI). Then, we introduce its two main functions, map and reduce, based on functional programming. After that, the notation of how MapReduce works is presented using the classical WordCount example as the Hello World of big data, discussing different ways to parallelize it and their main advantages and disadvantages. Next, we delve into MapReduce a bit more formally, and its functions in terms of key–value pairs, as well as the key properties of the map, shuffle, and reduce operations. At the end of the chapter we cover some important details as to how to achieve fault tolerance, how to exploit MapReduce to preserve data locality, how it can reduce data transfer across computers using combiners, and additional information about its internal working.

Keywords

MapReduce
functional programming
fault tolerance
map
shuffle
reduce
combiners
MPI
WordCount
key–value pairs

About the book

Chapter DOI https://doi.org/10.1017/9781009318242.003
Book DOI https://doi.org/10.1017/9781009318242
Subjects Computer Science,Data Science, Databases, Data Mining, and Information Retrieval,Machine Learning and Pattern Recognition
Format: Paperback
- Publication date: 08 February 2024
- ISBN: 9781009318259
Format: Digital
- Publication date: 15 December 2023
- ISBN: 9781009318242
Find out more details about this book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook

US$39.99

Paperback

US$39.99

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers

Large-Scale Data Analytics with Python and Spark A Hands-on Guide to Implementing Machine Learning Solutions