Large-Scale Data Analytics with Python and Spark: A Hands-on Guide to Implementing Machine Learning Solutions

Isaac Triguero; Mikel Galar

doi:10.1017/9781009318242

Chapter 1: Introduction

pp. 3-18

Isaac Triguero

, University of Nottingham,

Mikel Galar

, Public University of Navarre

Get access

Add bookmark
Cite
Share

Extract

This chapter introduces what we mean by big data, its importance, and the key principles for handling it efficiently. The term “big data” is frequently used to refer to the idea of exploiting lots of data to obtain some benefit, but there is no standard definition. It is also commonly known as large-scale data processing or data-intensive applications. We discuss the key components of big data, and how it is not all about volume, but also other aspects such as velocity and variety need to be considered. The world of big data has multiple faces such as databases, infrastructure, and security, but we focus on data analytics. Then, we cover how to deal with big data, explaining why we cannot scale up using a single computer, but we must scale out and use multiple machines to process the data. We suggest why traditional high-performance computing clusters are not appropriate for data-intensive applications and how they would collapse the network. Finally, we introduce key features of a big data cluster such as not being focused on pure computation, no need for high-end computers, the need for fault tolerance mechanisms, and respecting the principle of data locality.

Keywords

Scale-up
scale-out
big data cluster
data locality
data-intensive applications
high performance computing
big data faces
large-scale data processing
fault tolerance

About the book

Chapter DOI https://doi.org/10.1017/9781009318242.002
Book DOI https://doi.org/10.1017/9781009318242
Subjects Computer Science,Data Science, Databases, Data Mining, and Information Retrieval,Machine Learning and Pattern Recognition
Format: Paperback
- Publication date: 08 February 2024
- ISBN: 9781009318259
Format: Digital
- Publication date: 15 December 2023
- ISBN: 9781009318242
Find out more details about this book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook

US$39.99

Paperback

US$39.99

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers

Large-Scale Data Analytics with Python and Spark A Hands-on Guide to Implementing Machine Learning Solutions