Large-Scale Data Analytics with Python and Spark: A Hands-on Guide to Implementing Machine Learning Solutions

Isaac Triguero; Mikel Galar

doi:10.1017/9781009318242

Chapter 4: Spark

pp. 68-108

Isaac Triguero

, University of Nottingham,

Mikel Galar

, Public University of Navarre

Get access

Add bookmark
Cite
Share

Extract

This chapter introduces Spark, a data processing engine that mitigates some of the limitations of Hadoop MapReduce to perform data analytics efficiently. We begin with the motivation of Spark, introducing Spark RDDs as an in-memory distributed data structure that allows for faster processing while maintaining the attractive properties of Hadoop, such as fault tolerance. We then cover, hands-on, how to create and operate with RDDs, distinguishing between transformations and actions. Furthermore, we discuss how to work with key–value RDDs (which is more like MapReduce), how to use caching to perform iterative queries/operations, and how RDD lineage works to ensure fault tolerance. We provide a great range of examples with transformations such as map vs. flatMap and groupByKey vs. reduceByKey, discussing their behavior, adequacy (depending on what we want to achieve), and their performance. More advanced concepts, such as shared variables (broadcast and accumulators) or work by partitions are presented towards the end. Finally, we talk about the anatomy of a Spark application, as well as the different types of dependencies (narrow vs. wide) and the limitations on optimizing their processing.

Keywords

RDD
lineage graph
action
transformation
narrow vs. wide transformations
job
stage
task
broadcast and accumulators
cache

About the book

Chapter DOI https://doi.org/10.1017/9781009318242.005
Book DOI https://doi.org/10.1017/9781009318242
Subjects Computer Science,Data Science, Databases, Data Mining, and Information Retrieval,Machine Learning and Pattern Recognition
Format: Paperback
- Publication date: 08 February 2024
- ISBN: 9781009318259
Format: Digital
- Publication date: 15 December 2023
- ISBN: 9781009318242
Find out more details about this book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook

US$39.99

Paperback

US$39.99

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers

Large-Scale Data Analytics with Python and Spark A Hands-on Guide to Implementing Machine Learning Solutions