Intrusion detection in KDD99 dataset using machine learning

How machine learning can be used for intrusion detection in networks

Tooba Jamal
3 min readNov 27, 2021
Photo by Scott Webb on Unsplash

With the rise of internet usage and machine learning, intrusion detection has become a famous and necessary topic to discover in the field of machine learning. The intrusion detection system is software that monitors network traffic and raises an alert whenever an abnormal behavior/ connection is taken. The aim is to stop intruders from gaining access to the network and destroying it. The intruder can be any unwanted connection in your network with criminal intentions.

Our approach to intrusion detection

For this project, we have used the KDD-cup-99 dataset which is a 10% subset of the original KDD99 dataset. The dataset is widely used in academia for research purposes in the area of anomaly/ intrusion detection. KDD-cup-99 or KDD99 can be found here along with the description.

The dataset contains 42 features and 494,020 records. The target feature specifies the connection type which can be classified into normal and abnormal connection types. However, abnormal connections can further be categorized into four types of cyber-attacks which are Probe, Denial of Service (DOS), R2L (Root 2 Local), and U2R (User 2 Root).

Get the project started

Let the project begin with a little bit of exploration of the dataset after importing necessary libraries and of course loading the dataset.

Note that we had more than half of the redundant records. After dropping them we are left with 145,585 records.

Luckily, we have no missing values. Now let's define the input and output features and move further.

Now comes the crucial step of preprocessing. Using MinMaxScaler for scaling numeric records and Label Encoder to encode categorical features into numeric values.

Split the dataset in 70% training and 30% test set to train the model and evaluate predictions.

We have quite a lot of records with 41 input features which might take longer computation time and overfit the model as well. Therefore, using Random Forests to select the most important features from the dataset.

Now let's do the actual training and prediction

Let’s try plotting feature importance of our input features and see if we could extract some valuable insight from them.

Notice that the features selected after running Random Forests classifier for feature selection were serror_rate, same_srv_rate, diff_srv_rate,
dst_host_diff_srv_rate. However, according to the bar graph we plotted the top seven most important features are diff_srv_rate, same_srv_rate, dst_host_same_srv_rate, flag, count, dst_host_srv_count, and dst_host_srv_serror_rate.

Let’s train all three models with these seven features and see if it makes any difference.

See how accuracy has increased from 96.6% to 97.6% with the use of more relevant features than before. Random Forests and Decision Tree has worked best with 97.65% accuracy while K Nearest Neighbors have slightly lower accuracy of 97.64%.

Link to Github repository: https://github.com/ToobaJamal/Intrusion-Detection-in-KDD99-dataset

References

https://towardsdatascience.com/feature-selection-using-random-forest-26d7b747597f

--

--