Skip to main content Accessibility help
Internet Explorer 11 is being discontinued by Microsoft in August 2021. If you have difficulties viewing the site on Internet Explorer 11 we recommend using a different browser such as Microsoft Edge, Google Chrome, Apple Safari or Mozilla Firefox.

Chapter 4: Implementing Data Pre-processing in Python

pp. 155-186

Authors

, Washington State University, USA
  • Get access
  • Add bookmark
  • Export citation
  • Share

Extract

Chapter Objectives

  • • To understand the need for importing libraries like NumPy, Pandas, Matplotlib, Scikit–Learn.

  • • To learn the steps to import dataset.

  • • To understand the process for handling missing values.

  • • To discuss the steps for handling categorical data.

  • • To understand the need and process of splitting the dataset into training and testing datasets.

  • • To discuss the steps to perform feature scaling by using normalization and standardization.

Machine learning (ML) algorithms work on cleaned data. Usually, the data we collect for building ML models suffers from noise, missing values, inconsistent data types, and different data scales. This makes pre-processing of data a very important phase in preparing the data for building ML models. Pre-processing is when we apply transformations over the data before feeding it to the ML algorithm. In short, data pre-processing symbolizes a set of procedures applied to the data to make it fit for ML algorithms. It generally involves the following steps:

  • Step 1—Importing libraries: It involves importing the necessary libraries that are required to carry out the subsequent data manipulation and cleaning tasks.

  • Step 2—Loading the dataset: The dataset that needs to be pre-processed must be loaded.

  • Step 3—Handling the missing values: Dataset often contains missing or null values; these values need to be handled appropriately.

  • Step 4—Handling the categorical data: In the data pre-processing phase, it is crucial to address categorical attributes that often contain multiple categories. Handling categorical data becomes an important step to ensure proper treatment and transformation of these attributes.

  • Step 5—Splitting the dataset into training and testing datasets: Training and testing is the most important part of ML; thus, we need to split the dataset into training and testing subsets before building the ML models.

  • Step 6—Feature scaling: In datasets, the range of data often varies, or data is often of different scales. Thus, feature scaling needs to be done to ensure uniformity in results.

It is important to note that it is not necessary to apply all of these steps to pre-process the data. However, based on the nature of the dataset, some of these steps may be skipped for building the model. In the coming sections, we will discuss the importance or need of these steps and discuss how to perform these steps in Python.

About the book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook
US$49.99
Paperback
US$49.99

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers