Large-Scale Data Analytics with Python and Spark: A Hands-on Guide to Implementing Machine Learning Solutions

Isaac Triguero; Mikel Galar

doi:10.1017/9781009318242

Chapter 7: Machine Learning for Big Data

pp. 212-253

Isaac Triguero

, University of Nottingham,

Mikel Galar

, Public University of Navarre

Get access

Add bookmark
Cite
Share

Extract

This chapter puts forward new guidelines for designing and implementing distributed machine learning algorithms for big data. First, we present two different alternatives, which we call local and global approaches. To show how these two strategies work, we focus on the classical decision tree algorithm, revising its functioning and some details that need modification to deal with large datasets. We implement a local-based solution for decision trees, comparing its behavior and efficiency against a sequential model and the MLlib version. We also discuss the nitty-gritty of the implementation of decision trees in MLlib as a great example of a global solution. That allows us to formally define these two concepts, discussing the key (expected) advantages and disadvantages. The second part is all about measuring the scalability of a big data solution. We talk about three classical metrics, speed-up, size-up, and scale-up, to help understand if a distributed solution is scalable. Using these, we test our local-based approach and compare it against its global counterpart. This experiment allows us to give some tips for calculating these metrics correctly using a Spark cluster.

Keywords

Decision tree
global
local
scalability
speed-up
size-up
scale-up
divide and conquer
ensemble effect

About the book

Chapter DOI https://doi.org/10.1017/9781009318242.008
Book DOI https://doi.org/10.1017/9781009318242
Subjects Computer Science,Data Science, Databases, Data Mining, and Information Retrieval,Machine Learning and Pattern Recognition
Format: Paperback
- Publication date: 08 February 2024
- ISBN: 9781009318259
Format: Digital
- Publication date: 15 December 2023
- ISBN: 9781009318242
Find out more details about this book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook

US$39.99

Paperback

US$39.99

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers

Large-Scale Data Analytics with Python and Spark A Hands-on Guide to Implementing Machine Learning Solutions