Large-Scale Data Analytics with Python and Spark: A Hands-on Guide to Implementing Machine Learning Solutions

Isaac Triguero; Mikel Galar

doi:10.1017/9781009318242

Chapter 3: Hadoop

pp. 45-67

Isaac Triguero

, University of Nottingham,

Mikel Galar

, Public University of Navarre

Get access

Add bookmark
Cite
Share

Extract

Hadoop is an open-source framework, written in Java, for big data processing and storage that is based on the MapReduce programming model. This chapter starts off with a brief introduction to Hadoop and how it has evolved to become a solid base platform for most big data frameworks. We show how to implement the classical Word Count using Hadoop MapReduce, highlighting the difficulties in doing so. After that, we provide essential information about how the resource negotiator, YARN, and its distributed file system, HDFS, work. We describe step by step how a MapReduce process is executed on YARN, introducing the concepts of resource and node managers, application master, and containers, as well as the different execution models (standalone, pseudo-distributed, and fully-distributed). Likewise, we talk about the HDFS, covering the basic design of this filesystem, and what it means in terms of functionality and efficiency. We also discuss recent advances such as erasure coding, HDFS federation, and high availability. Finally, we expose the main limitations of Hadoop and how it has sparked the rise of many new big data frameworks, which now coexist within the Hadoop ecosystem.

Keywords

YARN
HDFS
Name Node
Data Node
HDFS federation
erasure coding
resource manager
application master
container
Hadoop ecosystem

About the book

Chapter DOI https://doi.org/10.1017/9781009318242.004
Book DOI https://doi.org/10.1017/9781009318242
Subjects Computer Science,Data Science, Databases, Data Mining, and Information Retrieval,Machine Learning and Pattern Recognition
Format: Paperback
- Publication date: 08 February 2024
- ISBN: 9781009318259
Format: Digital
- Publication date: 15 December 2023
- ISBN: 9781009318242
Find out more details about this book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook

US$39.99

Paperback

US$39.99

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers

Large-Scale Data Analytics with Python and Spark A Hands-on Guide to Implementing Machine Learning Solutions