Skip to main content Accessibility help
Internet Explorer 11 is being discontinued by Microsoft in August 2021. If you have difficulties viewing the site on Internet Explorer 11 we recommend using a different browser such as Microsoft Edge, Google Chrome, Apple Safari or Mozilla Firefox.

Chapter 3: Data Pre-processing

pp. 125-154

Authors

, Washington State University, USA
  • Get access
  • Add bookmark
  • Export citation
  • Share

Extract

Chapter Objectives

  • • To understand the need for data pre-processing.

  • • To learn about different phases of data pre-processing like data cleaning, data integration, data transformation, and data reduction.

  • • To understand the need for feature scaling.

  • • To comprehend normalization and standardization techniques for feature scaling.

  • • To understand principal component analysis for feature extraction.

  • • To pre-process the categorical data for building machine learning models.

3.1 Need for Data Pre-processing

We live in an age where data is considered oil because we need data to train machine learning (ML) algorithms. The most important job for a data analyst is to collect, clean, and analyze the data and build ML models on the cleaned dataset. But often, the raw data that we obtain is noisy. It consists of many discrepancies, inconsistencies, and often missing values. To understand this situation, let us consider an example.

Suppose we have to predict the house price, and for this, we have collected data from a few previous transactions, as shown in Figure 3.1.

In a perfect situation, the captured data should be of this format, as shown in Figure 3.1. Here, we have the size of the house and the number of bedrooms as input features, while the price is the output attribute. We can predict the price of an unknown instance through regression.

But practically, in most situations, the captured data is not of good quality, and usually, we have a dataset, as shown in Figure 3.2.

You can see that this data is messy. There are a lot of unknown or missing values, and if we trained the model on this data, its prediction would be very poor. Also, you can identify the noise and incorrect labels like the second record price is incorrect and will result in poor model training.

We can also consider some more examples like if someone entered –1 in the “salary credited” column in the case of employee dataset. It does not make any sense and will be considered noise. Sometimes, we may have an unrealistic and impossible combination of data; for example, let us consider a record where we have Gender–Male and Pregnant–Yes.

About the book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook
US$49.99
Paperback
US$49.99

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers