Machine Learning with Python: Principles and Practical Techniques

Parteek Bhatia

doi:10.1017/9781009170239

Chapter Objectives

• To understand the need for importing libraries like NumPy, Pandas, Matplotlib, Scikit–Learn.
• To learn the steps to import dataset.
• To understand the process for handling missing values.
• To discuss the steps for handling categorical data.
• To understand the need and process of splitting the dataset into training and testing datasets.
• To discuss the steps to perform feature scaling by using normalization and standardization.

Machine learning (ML) algorithms work on cleaned data. Usually, the data we collect for building ML models suffers from noise, missing values, inconsistent data types, and different data scales. This makes pre-processing of data a very important phase in preparing the data for building ML models. Pre-processing is when we apply transformations over the data before feeding it to the ML algorithm. In short, data pre-processing symbolizes a set of procedures applied to the data to make it fit for ML algorithms. It generally involves the following steps:

Step 1—Importing libraries: It involves importing the necessary libraries that are required to carry out the subsequent data manipulation and cleaning tasks.
Step 2—Loading the dataset: The dataset that needs to be pre-processed must be loaded.
Step 3—Handling the missing values: Dataset often contains missing or null values; these values need to be handled appropriately.
Step 4—Handling the categorical data: In the data pre-processing phase, it is crucial to address categorical attributes that often contain multiple categories. Handling categorical data becomes an important step to ensure proper treatment and transformation of these attributes.
Step 5—Splitting the dataset into training and testing datasets: Training and testing is the most important part of ML; thus, we need to split the dataset into training and testing subsets before building the ML models.
Step 6—Feature scaling: In datasets, the range of data often varies, or data is often of different scales. Thus, feature scaling needs to be done to ensure uniformity in results.

It is important to note that it is not necessary to apply all of these steps to pre-process the data. However, based on the nature of the dataset, some of these steps may be skipped for building the model. In the coming sections, we will discuss the importance or need of these steps and discuss how to perform these steps in Python.

Machine Learning with Python Principles and Practical Techniques

Chapter 4: Implementing Data Pre-processing in Python

Extract

About the book

Access options

Purchase options

Have an access code?