Large-Scale Data Analytics with Python and Spark: A Hands-on Guide to Implementing Machine Learning Solutions

Isaac Triguero; Mikel Galar

doi:10.1017/9781009318242

Chapter 5: Spark SQL and DataFrames

pp. 109-174

Isaac Triguero

, University of Nottingham,

Mikel Galar

, Public University of Navarre

Get access

Add bookmark
Cite
Share

Extract

Spark SQL is a module in Spark for structured data processing, which improves upon RDDs. The chapter explains how imposing structure on the data helps Spark perform further optimizations. We talk about the transition from RDDs to DataFrames and Datasets, including a brief description of the Catalyst and Tungsten projects. In Python, we don’t have Datasets, and we focus on DataFrames. With a learn-by-example approach, we see how to create DataFrames, inferring the schema automatically or manually, and operate with them. We show how these operations usually feel more natural for SQL developers, but we can interact with this API following an object-oriented programming style or SQL. Like we did with RDDs, we showcase various examples to demonstrate the functioning of different operations with DataFrames. Starting from standard transformations such as select or filter, we move to more peculiar operations like Column transformations and how to perform efficient aggregations using Spark functions. As advanced content, we include implementing user-defined functions for DataFrames, as well as an introduction to pandas-on-Spark, a powerful API for those programmers more used to pandas.

Keywords

DataFrame
Dataset
SQL
structured data
schema
Tungsten
Catalyst
UDF
pandas-on-Spark
explode
Column operations

About the book

Chapter DOI https://doi.org/10.1017/9781009318242.006
Book DOI https://doi.org/10.1017/9781009318242
Subjects Computer Science,Data Science, Databases, Data Mining, and Information Retrieval,Machine Learning and Pattern Recognition
Format: Paperback
- Publication date: 08 February 2024
- ISBN: 9781009318259
Format: Digital
- Publication date: 15 December 2023
- ISBN: 9781009318242
Find out more details about this book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook

US$39.99

Paperback

US$39.99

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers

Large-Scale Data Analytics with Python and Spark A Hands-on Guide to Implementing Machine Learning Solutions