Model Tuning Using Pipeline and GridSearchCV

Stephen Chen | December 23, 2021

After selecting my initial model and obtaining baseline results, I often look to improve my model’s time performance and targeted metrics by tuning my model’s features and parameters. This can become a tedious task of finding the optimal number of features and parameter sets that will provide the best results on the training data. For example, the Random Forest estimator allows me to select parameters such as the number of trees, maximum tree depth, and minimum samples for a leaf node. Luckily, scikit-learn’s Pipeline and GridSearchCV classes allow me to accomplish this task and iterate quickly with clean and readable python code.

First, the Pipeline class combines model transformers and an estimator into a step process. For instance, I can use SelectKBest for finding the K best features or PCA for reducing dimensionality and then apply the Random Forest estimator. The benefit of Pipeline is that my model is concise in a single compact workflow. This will be even more beneficial as we implement GridSearchCV afterwards.

Below, is an example on how to implement scikit-learn’s Pipeline class.

   
                     
#import PipeLine, SelectKBest transformer, and RandomForest estimator classes
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier

#initialize randomforest and selectKbest
selector = SelectKBest(k=100)
clf = RandomForestClassifier()

#place SelectKbest transformer and RandomForest estimator into Pipeine
pipe = Pipeline(steps=[('selector', selector), ('clf', clf)])

With my pipeline setup, I can use GridSearchCV. GridSearchCV will take my pipeline and a range of numbers I specify for each parameter and then find the combination set that provides the best estimator for my targeted metric. Essentially, GridSearchCV “searches” for the best estimator over my specified parameter grid with cross-validation. For instance, I will now use my Pipeline and select the parameters.

Below is an example on how to implement scikit-learn’s GridSearchCV class.

   
                     
from sklearn.model_selection import GridSearchCV
#Create the parameter grid, entering the values to use for each parameter selected in the RandomForest estimator
parameters = {'n_estimators':[20, 50, 100, 200], 'max_samples_split': [2, 5, 10, 20] }

#Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
g_search = GridSearchCV(pipe, parameters)

#Fit the grid search object to the training data and find the optimal parameters using fit()
g_fit = g_search(X_train, y_train)

#Get the best estimator and print out the estimator model
best_clf = g_fit.best_estimator_
print (best_clf)

#Use best estimator to make predictions on the test set
best_predictions = best_clf.predict(X_test)

Once, I have created my GridSearch, I can output the details of the best model by using GridSearch’s function and make predictions on the testing data. As you can see, Pipeline and GridSearchCV offer a systematic method for structuring and tuning your model in a concise manner and is a valuable addition in creating predictive models.