Methods to Preprocess Your Data
Stephen Chen | October 10, 2021
Before building my model, I explore and preprocess my data to ensure that the information I input into my model is valid. This step is just as important as building the model itself and is often overlooked. I will be discussing a couple of tools that I use to preprocess data.
One-hot Encoding
Sometimes my features will contain categorical data such as a person’s marital status which is non-numeric. Certain estimators do not take categorical data and need the information represented in numbers. This is where one-hot encoding helps. One-hot encoding takes my categorical feature and creates dummy variables for each possible value in my categorical feature. A dummy variable essentially provides a 1 when the value matches and a 0 when the value does not match.
For example, if the marital statuses for my categorical feature were single, married, and divorced, one-hot encoding would take my categorical feature and create three columns. In each column, a 0 or 1 would be shown if the value represented single in the single column, married in the married column, or divorced in the divorce column. By using one-hot encoding, I have turned a categorical non-numeric feature into a numeric feature with three columns. One-hot encoding is one of the popular methods for processing categorical data. To implement one-hot encoding, the default pandas class in Python has a function called get_dummies. Additionally, scikit-learn has a helpful class called OneHotEncoder.
Below is an example implementation for the one-hot encoding.
#import pandas class
import pandas as pd
#One-hot using the pandas.get_dummies() function
model_dataframe_with_get_dummies = pd.get_dummies(model_dataframe)
Normalization
Apart from one-hot encoding, other times I need to normalize the features in my data because the data has a wide range of values. For example, I can have a feature on a person’s age and another feature on a person’s income. These two features have a wide range of values from 0 to over a million. Normalizing my data in this case is important because certain estimators such as support vector machines or logistic regression perform calculations based on the distance between points. Without normalization, my estimator’s calculations will be incorrect and skewed toward larger values.
One method to normalize the data is to use scikit-learn’s class called MinMaxScaler. The class scales the feature to a value between 0 and 1. These boundaries can be also adjusted.
Below is an example implementation for the MinMaxScaler class.
#import MinMaxScaler preprocessing class
from sklearn.preprocessing import MinMaxScaler
#create the scaler object and initialize the class
scaler = MinMaxScaler()
#Specify the features to transform using column names
features_to_transform = ["age", "income"]
#Transform those features in your dataframe using the fit_transform method
model_dataframe[features_to_transform] = scaler.fit_transform(model_dataframe[features_to_transform])
Overall, data preprocessing is an important step in creating accurate and reliable models. There are many methods to preprocess data and it ultimately depends on the data and model’s application. Sci-kit learn provides various helpful classes for data preprocessing.