Large-Scale Data Analytics with Python and Spark: A Hands-on Guide to Implementing Machine Learning Solutions

Isaac Triguero; Mikel Galar

doi:10.1017/9781009318242

Chapter 6: Machine Learning with Spark

pp. 177-211

Isaac Triguero

, University of Nottingham,

Mikel Galar

, Public University of Navarre

Get access

Add bookmark
Cite
Share

Extract

This chapter introduces the machine learning side of things of this book. Although we assume some prior experience in machine learning, we start off with a full recap of the basic concepts and key terminology. This includes a discussion of learning paradigms, such as supervised and unsupervised learning, and the machine learning life cycle, articulating the steps to go from data collection to model deployment. We cover topics like data preparation and preprocessing, model evaluation and selection, and machine learning pipelines, showing how all the stages of this cycle are susceptible to being compromised when we talk about large-scale data analytics. After that, the rest of the chapter is devoted to the machine learning library of Spark, MLLib. Basic concepts such as Transformers, Estimators, and Pipelines are presented with an example using linear regression. The example provided forces us to use a pipeline of methods to get the data ready for training. This allows us to introduce some of the data preparation packages of Spark (e.g., VectorAssembler or StandardScaler). Finally, we explore evaluation packages (e.g., RegressionEvaluator) and how to perform hyperparameter tuning.

Keywords

Machine learning
MLlib
Transformer
Estimator
Pipeline
machine learning life cycle
hyperparameter tuning

About the book

Chapter DOI https://doi.org/10.1017/9781009318242.007
Book DOI https://doi.org/10.1017/9781009318242
Subjects Computer Science,Data Science, Databases, Data Mining, and Information Retrieval,Machine Learning and Pattern Recognition
Format: Paperback
- Publication date: 08 February 2024
- ISBN: 9781009318259
Format: Digital
- Publication date: 15 December 2023
- ISBN: 9781009318242
Find out more details about this book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook

US$39.99

Paperback

US$39.99

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers

Large-Scale Data Analytics with Python and Spark A Hands-on Guide to Implementing Machine Learning Solutions