Skip to main content Accessibility help
Internet Explorer 11 is being discontinued by Microsoft in August 2021. If you have difficulties viewing the site on Internet Explorer 11 we recommend using a different browser such as Microsoft Edge, Google Chrome, Apple Safari or Mozilla Firefox.

Chapter 5: Spark SQL and DataFrames

Chapter 5: Spark SQL and DataFrames

pp. 109-174

Authors

, University of Nottingham, , Public University of Navarre
Resources available Unlock the full potential of this textbook with additional resources. There are free resources and Instructor restricted resources available for this textbook. Explore resources
  • Add bookmark
  • Cite
  • Share

Extract

Spark SQL is a module in Spark for structured data processing, which improves upon RDDs. The chapter explains how imposing structure on the data helps Spark perform further optimizations. We talk about the transition from RDDs to DataFrames and Datasets, including a brief description of the Catalyst and Tungsten projects. In Python, we don’t have Datasets, and we focus on DataFrames. With a learn-by-example approach, we see how to create DataFrames, inferring the schema automatically or manually, and operate with them. We show how these operations usually feel more natural for SQL developers, but we can interact with this API following an object-oriented programming style or SQL. Like we did with RDDs, we showcase various examples to demonstrate the functioning of different operations with DataFrames. Starting from standard transformations such as select or filter, we move to more peculiar operations like Column transformations and how to perform efficient aggregations using Spark functions. As advanced content, we include implementing user-defined functions for DataFrames, as well as an introduction to pandas-on-Spark, a powerful API for those programmers more used to pandas.

Keywords

  • DataFrame
  • Dataset
  • SQL
  • structured data
  • schema
  • Tungsten
  • Catalyst
  • UDF
  • pandas-on-Spark
  • explode
  • Column operations

About the book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook
US$39.99
Paperback
US$39.99

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers