Combating the curse of dimensionality

Methods to reduce dimensions of your dataset to reduce overfitting and speed up the process

Tooba Jamal
Towards Data Science

--

Colorful lockers with sunlight shinning upon them.
Photo by moren hsu on Unsplash

When you are working with high dimension data, you are more likely to encounter problems like overfitting, longer computation times, and/or ambiguous insights. In this article, we will explore some methods for dimensionality reduction in high dimension data. But before that, let’s talk about a few things that are important to understand before getting deeper into dimensionality reduction.

There are two ways of dimensionality reduction i.e. feature selection and feature extraction. Feature selection refers to selecting the most important features from a dataset in their original form for modeling. Whereas, feature extraction refers to creating new features from the existing ones without losing any important information.

The number of features in a dataset must be less than the total number of records and if the features are to be increased, the number of observations should increase exponentially. A dataset with more features and fewer records will result in greatly overfitted models which we always want to avoid. Hence, we drop unimportant features which do not contribute much to the decision and keep only the important ones. But, how do we know which features are important for us? Luckily, Python libraries make the process of dimensionality reduction way easier for us. So, let’s understand different ways of reducing dimensions in data and how to implement them in Python.

Feature selection with a variance threshold

As the name suggests, variance represents the variability in a dataset indicating how much spread out the distribution is. If the variance of a feature is very low, the feature is not as important. Let’s suppose we have a dog health dataset with different features indicating a dog’s weight, height, BMI, etc. If the weight feature has only one or maybe two values throughout, the variance of weight is zero and hence it is unimportant for us. Since weight is the same in all records, it is obvious that this feature does not hold much information for us in yielding unknown insights or contributing to the training of predictive models.

Sample dog species dataset with weight, height, and species features to explain the effect of variance in a dataset. The dataset has three dog species i.e. Bull dog, labrador and poodle.
Sample dataset created by author

Python’s scikit-learn library provides an easy way to drop low variance features with VarianceThrehold estimator. We can set threshold parameters equal to whatever variance threshold we want to set. Features having variance lower than the threshold are automatically dropped.

Python implementation of Variance Threshold

In lines 1 and 2 we import and define the estimator with a threshold equal to 1. In line 5 we fit the estimator to our dataset. In line 8 we create a mask of variables with features having variance equal to one or larger and in line 9, we obtain the reduced dataset by applying the mask to the original dataset.

Feature selection with pairwise correlation

Correlation defines the relationship between two features. Positive correlation means that if one feature increases the other also increases, and negative correlation means that if a feature increases the other decreases. To understand clearly, assume the same dog species dataset but this time we will be looking at two features i.e. weight and height. If the correlation among these two variables is 0.99, they have a strong positive correlation meaning if the weight of a dog increases its height also increases, and vice versa. Hence, using both of these features for modeling will only result in overfitting and resource/ time expense as one feature is enough to map the relationship between the feature and the target.

Python implementation of pairwise correlation

In line 1, we calculate the correlation of all the features in our dataset. In lines 4 and 5, we create a boolean mask of the correlation matrix to get rid of repeating correlation values and apply the mask to the original correlation matrix. In lines 8, 9, and 10, we run a loop over reduced matrix columns to get columns having a correlation greater than 0.9 and print the column names we need to drop.

Recursive feature elimination

Recursive feature elimination or RFE is a feature selection technique that uses a supervised machine learning algorithm. The algorithm helps it to get the coefficients of each feature and eliminates the feature with the least significant coefficient at each iteration until it reaches the desired number of features.

Python implementation of RFE

In lines 1 and 2 we import the Decision Tree classifier and RFE. In lines 5–7, we define Decision Tree as dt and RFE as rfe where estimator is set equal to dt and n_features_to_select is set to five which means we want the five most important features from our dataset. In line 10, we define important features as selected_features and in line 11, we print the selected_features to have a look at our most important features.

Feature extraction with principal component analysis or PCA

PCA uses linear algebra to transform original variables into new ones with the help of covariance matrix, eigenvalues, and eigenvectors. It starts off by finding eigenvalues and eigenvectors of the covariance matrix and uses eigenvalues and eigenvectors to compute the principal components. Higher eigenvalues are of significance which forms the principal components and a feature matrix is obtained by storing variables with significant information in it. The final data matrix is obtained by the multiplication of the transpose of the original dataset with the transpose of the obtained feature vector.

Python implementation of PCA

In line 1, we import PCA. In line 4, we define PCA as pca and passed n_components equal to five indicating that we need five extracted features. In lines 7 and 8, we create a new dataset containing our newly extracted features.

Conclusion

In this article, we have discussed the curse of dimensionality and understood the difference between feature selection and feature extraction. We have then learned four dimensionality reduction techniques and how to implement them in Python.

I really hope, this has helped you learn new skills or dismantle doubts about dimensionality reduction. If you have any questions, reach out to me in the comments. Hit the clap and share the article if you like it. Thank you for reading!

--

--