Machine Learning with Python: Principles and Practical Techniques

Parteek Bhatia

doi:10.1017/9781009170239

Chapter 3: Data Pre-processing

pp. 125-154

Parteek Bhatia

, Washington State University, USA

Get access

Add bookmark
Export citation
Share

Extract

Chapter Objectives

• To understand the need for data pre-processing.
• To learn about different phases of data pre-processing like data cleaning, data integration, data transformation, and data reduction.
• To understand the need for feature scaling.
• To comprehend normalization and standardization techniques for feature scaling.
• To understand principal component analysis for feature extraction.
• To pre-process the categorical data for building machine learning models.

3.1 Need for Data Pre-processing

We live in an age where data is considered oil because we need data to train machine learning (ML) algorithms. The most important job for a data analyst is to collect, clean, and analyze the data and build ML models on the cleaned dataset. But often, the raw data that we obtain is noisy. It consists of many discrepancies, inconsistencies, and often missing values. To understand this situation, let us consider an example.

Suppose we have to predict the house price, and for this, we have collected data from a few previous transactions, as shown in Figure 3.1.

In a perfect situation, the captured data should be of this format, as shown in Figure 3.1. Here, we have the size of the house and the number of bedrooms as input features, while the price is the output attribute. We can predict the price of an unknown instance through regression.

But practically, in most situations, the captured data is not of good quality, and usually, we have a dataset, as shown in Figure 3.2.

You can see that this data is messy. There are a lot of unknown or missing values, and if we trained the model on this data, its prediction would be very poor. Also, you can identify the noise and incorrect labels like the second record price is incorrect and will result in poor model training.

We can also consider some more examples like if someone entered –1 in the “salary credited” column in the case of employee dataset. It does not make any sense and will be considered noise. Sometimes, we may have an unrealistic and impossible combination of data; for example, let us consider a record where we have Gender–Male and Pregnant–Yes.

About the book

Book DOI https://doi.org/10.1017/9781009170239
Subjects Communications and Signal Processing,Computer Science,Engineering,Machine Learning and Pattern Recognition
Format: Paperback
- Publication date: 26 March 2026
- ISBN: 9781009170246
Format: Digital
- Publication date: 22 February 2025
- ISBN: 9781009170239
Find out more details about this book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook

US$49.99

Paperback

US$49.99

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers

Machine Learning with Python Principles and Practical Techniques