Airbnb Booking Destination Prediction with Machine Learning
Introduction
The goal of this project is to build a machine learning model that predicts whether a visitor on Airbnb website will book a destination if yes, where would s/he want to go.
The dataset can be found here. The training dataset has 213,451 rows and 16 columns and the test dataset has 62,096 rows.
The project is divided into the following steps:
- Download the data
- Exploratory Data Analysis
- Feature Engineering
- Data transformation
- Model building and prediction
Downloading the dataset
Let's start with downloading the dataset from Kaggle using opendatasets library from jovian.
Exploratory Data Analysis
We have saved our training dataset in train_df and testing dataset in test_df. Now let's explore our data. Let's start by importing the libraries we are gonna need during the process and set the style for our visualizations.
The data might have some null values. Lets check them by using .isna().sum() on our dataset.
See that date_first_booking, signup_method and first_affiliate_tracked
has 124,543, 87990, and 6065 null values respectively.
Now let's explore the proportion of males and females in our dataset.
Notice that females are about 5% more than males.
Now let's have a look at different age groups people belong to
Most of the customers are 31 years of age. However, there seem to be a lot of outliers in the age column of our training dataset.
Check the signup methods most used
The language most customers prefer
Just as I thought, it's English!
What affiliate channels do customers come from?
I use the desktop to signup for most of the apps. Let's see what Airbnb customers prefer.
What's the device Airbnb customers use on their first visit/ activity?
Mac desktop is pretty famous!
What browser do Airbnb customers use on their first activity?
I too love chrome but I am AMAZED to see Internet Explorer here! Are you?
Let's explore the most interesting thing now. What destination do Airbnb customers travel to the most?
NDF refers to no destination found. However, people really love the US!
Feature Engineering
Our dataset has two columns in date format. Let’s split them into year, month and day first.
Both date_account_created and timestamp_first_active are split into year, month and day.
Remember date first booking feature had a lot of null values. Let's check if the destination is also null for those values.
Where date_first_booking is null, country_destination is also null or NDF so it would be fine to drop null date_first_booking records.
The dataset is huge enough to take a longer computation time. Therefore, selecting the most relevant features out of the training data to feed into our machine learning models
Data Transformation
Let's remove outliers from age by dropping rows with age greater than 110 and less than 18.
Now have a look at the boxplot of age
Let's define numeric and categorical columns for feature transformation
Checking null values in numeric columns in both training and test data
Null first booking year, month, and day might be because the customer never made a booking. Therefore, put 0 in place of null
We also have null values in age. Let’s replace those by 31 as its the most occurring age among all the customers
Now, scale the numeric columns in the range of 0–1 by using MinMaxScaler
Replace null values in first_affiliate_tracked with Unknown before categorical encoding
Let’s define a function to encode categorical columns in our dataset
We have used LabelEncoding to avoid an increased number of features
Encode both training and test data
Model building and prediction
Split the training data into training and validation data in 70–30 ratio
We will use Logistic Regression and XGBClassifier to make predictions. Let’s start off by importing XGBClassifier
Let's train and fit our model with 50 n_estimators, max_depth of 50, and a learning rate of 0.5
Fitting the model on selected features from the training dataset
Let's have a look at the accuracy score
Let's try 100 n_estimators now
Let's try decreasing the learning rate to 0.01
Try with 70 n_estimators
Note that XGBClassifier worked best with 50 n_estimators and a learning rate of 0.05 with 87.63% accuracy on training data and 87.43% accuracy on validation data.
Now lets train and predict our target feature with Logistic Regression
Let's try changing solver to sag and see if it impacts our predictions
The accuracy is actually the same with both liblinear and sag
Conclusion
- Female customers are slightly more than male customers of Airbnb
- Google Chrome is the most popular browser used by Airbnb customers
- Most customers like to visit the US
- The data has many outliers considering the age was in hundreds and thousands
- Most Airbnb customers are Mac desktop users and 25–37 years of age
- XGBClassifier worked best with 50 n_estimators and 0.05 learning_rate with 87.63% and 87.43% accuracy on training and validation dataset respectively
- Logistic Regression gave slightly lower accuracy on training data than XGBClassifier 87.62% on training and 87.43% on validation data
Future Work
In the future, I aim to improve this project by trying better techniques for hyperparameter tuning and training the models with more features.