Price recommendations for AirBnB listings

Motivation

I wanted to check if I can use XGBoost to predict AirBnb listing prices based on information that is typically found in the listing page. Price prediction and recommendations are a big part of the offering for a lot of travel related websites (hotels.com, booking.com, etc).

The Data

The data is from Inside Airbnb. For this exercise, I picked up only listings from Toronto.

Imports

Reading in the Data

EDA

Data types and Non-null counts

Taking a quick look at the data; it seems like it would serve us well to drop quite a few columns. (neighbourhood_group_cleansed, bathrooms, id to name a few)

Target Distribution

The distribution of price is expectedly skewed with a few outliers. There are about 300+ listings for the CAD 0-50 range, but that should be acceptable since these are per-night prices.

Correlation Matrix

There are quite a few features available in this dataset. Based on the column names, there could be a considerable amount of multicolinearity in the predictors. Plotting out a correlation matrix would help us understand why a feature is dropped when/if we do some kind of feature selection. For example, all the review-* features are correlated with each other, indicating we might be better off keeping just one in our model. Same goes for the features which are a proxy for size: accomodates, bedrooms, beds

As an interesting side note, while the size related features are (unsurprisingly) positively correlated with the price, the number of listings the host puts up seems to be negatively correlated. Also interesting to note is the lack of a strong correlation with reviews and price.

Feature Engineering

  1. Separate the numeric columns so they can be scaled
  2. property_type and neighbourhood_cleansed don't have a intuitive order to the categorical levels. They will be One-Hot encoded.
  3. room_type and bathrooms_text have an order to their levels and should be encoded ordinally.
  4. amenities is a list of available amenities. I selected a few 'important' amenities and encoded them as binary.
  5. id is noise and should be dropped.

Splitting the data

20% test and 80% training. The data set is large enough for us to stick with the traditional 80-20 split.

I used MAE to reduce the impact of outliers.

Model Tuning

The XGBRegressor is a tree based model and we are overfitting a tad with the default values. We should try tweaking the learning parameters. I have added a RFECV step in the pipeline to drop some of the features and reduce the complexity of the model.

The tuning had a marginal positive effect on the test score.

Scoring On Test Data

Wrap up and improvements

Being off by CAD 28.15 is reasonable for the mid-high range (CAD 100+). However, we have a considerable number of listings in the <100 CAD range, and being 30% off is not great. We could:

  1. Investigate the outliers in the data.
  2. In the interest of time, I didn't use all the features available. I might have dropped a good predictor in that process. That is worth another look.
  3. Use a neural network with PyTorch