Airbnb Booking Destination Prediction with Machine Learning

4 min readSep 21, 2021

Introduction

The goal of this project is to build a machine learning model that predicts whether a visitor on Airbnb website will book a destination if yes, where would s/he want to go.

The dataset can be found here. The training dataset has 213,451 rows and 16 columns and the test dataset has 62,096 rows.

The project is divided into the following steps:

Download the data
Exploratory Data Analysis
Feature Engineering
Data transformation
Model building and prediction

Downloading the dataset

Let's start with downloading the dataset from Kaggle using opendatasets library from jovian.

Exploratory Data Analysis

We have saved our training dataset in train_df and testing dataset in test_df. Now let's explore our data. Let's start by importing the libraries we are gonna need during the process and set the style for our visualizations.

The data might have some null values. Lets check them by using .isna().sum() on our dataset.

See that date_first_booking, signup_method and first_affiliate_tracked

has 124,543, 87990, and 6065 null values respectively.

Now let's explore the proportion of males and females in our dataset.

Notice that females are about 5% more than males.

Now let's have a look at different age groups people belong to

Most of the customers are 31 years of age. However, there seem to be a lot of outliers in the age column of our training dataset.

Check the signup methods most used

The language most customers prefer

Just as I thought, it's English!

What affiliate channels do customers come from?

I use the desktop to signup for most of the apps. Let's see what Airbnb customers prefer.

What's the device Airbnb customers use on their first visit/ activity?

Mac desktop is pretty famous!

What browser do Airbnb customers use on their first activity?

I too love chrome but I am AMAZED to see Internet Explorer here! Are you?

Let's explore the most interesting thing now. What destination do Airbnb customers travel to the most?

NDF refers to no destination found. However, people really love the US!

Feature Engineering

Our dataset has two columns in date format. Let’s split them into year, month and day first.

Both date_account_created and timestamp_first_active are split into year, month and day.

Remember date first booking feature had a lot of null values. Let's check if the destination is also null for those values.

Where date_first_booking is null, country_destination is also null or NDF so it would be fine to drop null date_first_booking records.

The dataset is huge enough to take a longer computation time. Therefore, selecting the most relevant features out of the training data to feed into our machine learning models

Data Transformation

Let's remove outliers from age by dropping rows with age greater than 110 and less than 18.

Now have a look at the boxplot of age

Let's define numeric and categorical columns for feature transformation

Checking null values in numeric columns in both training and test data

Null first booking year, month, and day might be because the customer never made a booking. Therefore, put 0 in place of null

We also have null values in age. Let’s replace those by 31 as its the most occurring age among all the customers

Now, scale the numeric columns in the range of 0–1 by using MinMaxScaler

Replace null values in first_affiliate_tracked with Unknown before categorical encoding

Let’s define a function to encode categorical columns in our dataset

We have used LabelEncoding to avoid an increased number of features

Encode both training and test data

Model building and prediction

Split the training data into training and validation data in 70–30 ratio

We will use Logistic Regression and XGBClassifier to make predictions. Let’s start off by importing XGBClassifier

Let's train and fit our model with 50 n_estimators, max_depth of 50, and a learning rate of 0.5

Fitting the model on selected features from the training dataset

Let's have a look at the accuracy score

Let's try 100 n_estimators now

Let's try decreasing the learning rate to 0.01

Try with 70 n_estimators

Note that XGBClassifier worked best with 50 n_estimators and a learning rate of 0.05 with 87.63% accuracy on training data and 87.43% accuracy on validation data.

Now lets train and predict our target feature with Logistic Regression

Let's try changing solver to sag and see if it impacts our predictions

The accuracy is actually the same with both liblinear and sag

Conclusion

Female customers are slightly more than male customers of Airbnb
Google Chrome is the most popular browser used by Airbnb customers
Most customers like to visit the US
The data has many outliers considering the age was in hundreds and thousands
Most Airbnb customers are Mac desktop users and 25–37 years of age
XGBClassifier worked best with 50 n_estimators and 0.05 learning_rate with 87.63% and 87.43% accuracy on training and validation dataset respectively
Logistic Regression gave slightly lower accuracy on training data than XGBClassifier 87.62% on training and 87.43% on validation data

Future Work

In the future, I aim to improve this project by trying better techniques for hyperparameter tuning and training the models with more features.