Hyperparameter tuning in Python

Tips and tricks to tune hyperparameters in machine learning that help improve model accuracy

Tooba Jamal
Towards Data Science

--

A dartboard with a lot of holes and four darts fixed at it with one right at the target. The dartboard represents a machine learning model and the darts are hyperparameters. The one at the target is a hyperparameter with best value that gives the highest accuracy of the model.
Photo by Afif Kusuma on Unsplash

Hyperparameter tuning used to be a challenge for me when I was a newbie to machine learning. I always hated the hyperparameter tuning part in my projects and would usually leave them right after trying a couple of models and manually choosing the one with the highest accuracy among all. But now that my concepts are clear, I am presenting you with this article to make it easy for any newbie out there while the hyperparameters of my current project get tuned.

Let’s start with the difference between parameters and hyperparameters which is extremely important to know. Parameters are the components of the model that are learned during the training process and we can never set them manually. A model starts the training process with random parameter values and adjusts them throughout. Whereas, hyperparameters are the components set by you before the training of the model. The values of hyperparameters might improve or worsen your model’s accuracy.

What is the need for hyperparameter tuning in machine learning?

Machine learning models are not intelligent enough to know what hyperparameters would lead to the highest possible accuracy on the given dataset. However, hyperparameter values when set right can build highly accurate models, and thus we allow our models to try different combinations of hyperparameters during the training process and make predictions with the best combination of hyperparameter values. Some of the hyperparameters in Random Forest Classifier are n_estimators (total number of trees in a forest), max_depth (the depth of each tree in the forest), and criterion (the method to make splits in each tree). n_estimators set to 1 or 2 doesn’t make sense as a forest must have a higher number of trees, but how do we know what number of trees will yield the best results? And for this purpose, we try different values like [100, 200, 300]. The model will try all three of the given values and we can easily identify the optimal number of trees in our forest.

Hyperparameter tuning in Python

We have three methods of hyperparameter tuning in python are Grid search, Random search, and Informed search. Let’s talk about them in detail.

Grid Search

An eyeshadow pallete with sixteen bright colors. Each eyeshadow color represents a different combination of hyperparameter values.
Photo by Sharon McCutcheon on Unsplash

A grid is a network of intersecting lines that forms a set of squares or rectangles like the image above. In grid search, each square in a grid has a combination of hyperparameters and the model has to train itself on each combination. For a clearer understanding, suppose that we want to train a Random Forest Classifier with the following set of hyperparameters.

n_estimators: [100, 150, 200]

max_depth: [20, 30, 40]

Have a look at the grid made from these hyperparameter values. Our model runs the training process on each combination of n_estimators and max_depth

A grid of hyperparameter values with the given n_estimators and max_depth values which are [100, 150, 200] and [20, 30, 40] respectively.
Representation of hyperparameter grid created by author

Implementation of Grid Search in Python

Scikit-learn library in Python provides us with an easy way to implement grid search in just a few lines of code. Have a look at the example below

In lines 1 and 2, we import GridSearchCV from sklearn.model_selection and define the model we want to perform hyperparameter tuning on. In line 3, the hyperparameter values are defined as a dictionary where keys are the hyperparameter name and a list of values containing hyperparameter values we want to try.

In line 4 GridSearchCV is defined as grid_lr where estimator is the machine learning model we want to use which is Logistic Regression defined as model in line 2. Hence estimator is equal to model, param_grid is equal to grid_vals which we have defined in line 3, scoring is equal to accuracy which means we want to use accuracy as an evaluation technique for our model, cv is set to 6 meaning we want the model to undergo 6 cross-validations, the refit argument is set to True so that we can easily fit and make predictions.

In line 9, we fit grid_lr to our training dataset and in line 10 we use the model with the best hyperparameter values using grid_lr.best_estimator_ to make predictions on the test dataset.

Pros and Cons of Grid Search

Grid search is easy to implement to find the best model within the grid. However, it is computationally expensive as the number of the model continues to multiply when we add new hyperparameter values.

Random Search

Like grid search, we still set the hyperparameter values we want to tune in Random Search. However, the model does not train each combination of hyperparameters, it instead selects them randomly. We have to define the number of samples we want to choose from our grid.

Implementation of Random Search in Python

In lines 1 and 2 we import random search and define our model, using Random Forests in this example. In line 3, we define the hyperparameter values we want to check.

In line 5 RandomizedSearchCV is defined as random_rf where estimator is equal to RandomForestClassifier defined as model in line 2. Param_distributions (same as param_grid in Grid Search) is equal to param_vals which we have defined in line 3, n_iter refers to the number of samples we want to draw from all the hyperparameter combinations which are set to 10, scoring is equal to accuracy which means we want to use accuracy as an evaluation technique for our model, cv is set to 5 meaning we want the model to undergo 5 cross-validations, the refit argument is set to True so that we can easily fit and make predictions, n_jobs equal to -1 means we want to use all the resources available to undergo this randomized search.

In lines 11 and 12, we fit random_rf to our training dataset and use the best model using random_rf.best_estimator_ to make predictions on the test dataset.

Note that the total number of iterations is equal to n_iter * cv which is 50 in our example as ten samples are to be drawn from all hyperparameter combinations for each cross-validation.

Pros and Cons of Random Search

Random search is computationally cheaper. However, it is not guaranteed to find the best score from the sample space.

Informed Search

Informed search is my favorite method of hyperparameter tuning for the reason that it uses the advantages of both grid and random search. However, it has its own disadvantages. Unlike grid and random search, informed search learns from its previous iterations through the following process

  1. Random search
  2. Find areas with good score
  3. Run grid search in a smaller area
  4. Continue until the optimal solution is obtained

Genetic algorithm is a method of informed hyperparameter tuning which is based upon the real-world concept of genetics. We start by creating some models, pick the best among them, create new models similar to the best ones and add some randomness until we reach our goal.

Implementation of Genetic Algorithm in Python

The library we use here is tpot having generation (iterations to run training for), population_size (number of models to keep after each iteration), and offspring_size (number of models to produce in each iteration) are key arguments. Have a look at the example below

In line 1, we import the TPOTClassifier. In line 2, we define the classifier as tpot_clf. Generations, population_size, and off_spring_size is set to 100. Verbose = 2 will let us see the output of each generation (iteration), cv is set to 6, meaning we want to run 6 cross-validations for each iteration.

In lines 6 and 7 we have trained tpot_clf to our training set and made predictions on the test set.

Note that we have not defined any model here as TPOTClassifier takes care of choosing the model for our dataset.

Pros and Cons of Genetic Algorithm

As discussed above, it uses the advantages of both grid and random search. Genetic algorithm learns from its previous iterations, tpot library takes care of the estimating best hyperparameter values and selecting the best model. However, it is computationally expensive and time-consuming.

Conclusion

In this article, we have gone through three hyperparameter tuning techniques using Python. All three of Grid Search, Random Search, and Informed Search come with their own advantages and disadvantages, hence we need to look upon our requirements to pick the best technique for our problem.

I hope this article will help you improve your machine learning models’ accuracy in less time.

Please provide your feedback and share the article if you like it. Thank you for reading!

References

https://campus.datacamp.com/courses/hyperparameter-tuning-in-python

--

--