This project builds a recommender system for Cape Town Airbnb listings to help hosts optimize pricing, occupancy, and guest satisfaction. Using data from Inside Airbnb, the model suggests optimal prices based on listing features, guest sentiment, and occupancy patterns, helping hosts set competitive and profitable prices.
The dataset comprises information about Airbnb listings in Cape Town, focusing on features that influence pricing, occupancy, and guest satisfaction. Key data files include:
- listings.csv: Contains details about each listing, such as property type, location, amenities, and host information.
- reviews.csv: Provides guest feedback, which is used to derive sentiment scores.
- Property Details: Includes
property_type
,accommodates
,bathrooms
,bedrooms
, andbeds
. - Host Information: Fields like
host_id
,host_response_rate
,host_is_superhost
, andhost_listings_count
. - Pricing and Occupancy: Columns such as
price
,availability
, andnumber_of_reviews
. - Guest Sentiment: Derived from guest comments using sentiment analysis to score the emotional tone of each review.
This structure enables the model to incorporate a wide range of factors influencing Airbnb's performance in Cape Town.
Data preprocessing involved several steps to prepare the dataset for modeling:
-
Data Cleaning:
- Removed duplicate entries and irrelevant columns.
- Addressed missing values through imputation for numerical columns and frequency encoding for categorical ones.
-
Feature Engineering:
- Created new features such as
sentiment_score
, extracted from guest reviews using sentiment analysis. - Encoded categorical variables, applying frequency encoding to columns like
neighbourhood_cleansed
andproperty_type
for improved model performance.
- Created new features such as
-
Transformations:
- Log-transformed the
price
column to reduce skewness and approximate a normal distribution.
- Log-transformed the
-
Data Splitting:
- Split the data into training and testing sets, ensuring each listing appeared only once in the analysis.
These preprocessing steps allowed for better handling of categorical data and helped optimize model performance.
The EDA focused on understanding key trends and distributions in the dataset. Below are some visual insights generated:
-
Price Distribution: Showcasing the range and skewness of listing prices.
-
Property Type Breakdown: An overview of listing types, such as entire homes, private rooms, etc.
-
Neighborhood Popularity: Visualizing the distribution of listings across various neighborhoods.
-
Occupancy Rates by Property Type: Analyzing how occupancy varies among different property types.
-
Sentiment Score Distribution: Analysis of guest review sentiments, highlighting the frequency of positive and negative feedback.
These visualizations provided a foundation for understanding factors like pricing, guest sentiment, and listing characteristics that inform the recommendations and predictions.
To gain insights into guest satisfaction, sentiment analysis was performed on guest reviews. Using VADER sentiment analysis, each review was assigned a compound sentiment score ranging from -1 (most negative) to +1 (most positive).
- Sentiment Score Classification: Reviews were classified as positive (score > 0) or negative (score ≤ 0).
- Top Positive Reviews: Listings with the highest positive sentiment scores were identified to understand what guests appreciated most.
The sentiment analysis revealed a predominantly positive sentiment in reviews, indicating a high level of guest satisfaction across most listings.
This analysis helped highlight the factors contributing to positive guest experiences, crucial for optimizing listings.
To identify the most accurate model for predicting Airbnb listing prices, several models were tested, with Polynomial Regression Model emerging as the best-performing model due to its ability to handle large datasets efficiently and capture non-linear relationships effectively. This is the detailed summary of all the models used:
Model | Train RMSE | Test RMSE | R² Score |
---|---|---|---|
Baseline Model: Linear Regression | 0.67 | 0.76 | 0.5961 |
Linear Regression (with PCA) | 0.71 | 0.79 | 0.5721 |
Decision Tree Model | 0.61 | 0.70 | 0.6412 |
Random Forest | 0.17 | 0.57 | 0.7408 |
KNN Regression Model | 0.56 | 0.68 | 0.6616 |
Tuned KNN Model (with Grid Search) | 0.54 | 0.68 | 0.6621 |
Polynomial Regression | 0.58 | 0.64 | 0.6920 |
XGBoost Model | 0.43 | 0.55 | 0.7579 |
LightGBM Model | 0.32 | 0.54 | 0.7699 |
Neural Network Model | 0.56 | 0.64 | 0.6900 |
- Metrics: Evaluated using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE).
- Performance Summary:
- Train RMSE: 0.58, Test RMSE: 0.64
- Train MAE: 0.35, Test MAE: 0.37
- R²: 0.6920
For personalized recommendations based on user sentiment, collaborative filtering methods were employed. These models utilize sentiment scores derived from guest reviews to make relevant listing suggestions.
- SVD: Singular Value Decomposition, leveraging latent factors from review sentiment data.
- KNNBaseline: Collaborative filtering based on user-user similarity.
- BaselineOnly: A baseline model that estimates biases for users and items.
- SlopeOne: A collaborative filtering model that provides item-item similarity-based predictions.
The following metrics were used to evaluate the collaborative filtering models:
Model | RMSE | MAE |
---|---|---|
SVD | 0.4491 | 0.2419 |
KNN with Baseline | 0.4728 | 0.2594 |
BaselineOnly | 0.3762 | 0.2396 |
SlopeOne | 0.3821 | 0.2342 |
The BaselineOnly model outperformed the other models in terms of RMSE and MAE, demonstrating its suitability for sentiment-driven recommendations.
The BaselineOnly model showed superior performance, with the lowest RMSE and MAE, indicating it provides the most accurate recommendations. The SlopeOne model also performed well but with slightly higher error metrics.
This section guides utilizing the models and data insights from this project:
-
Prediction Model:
- Run the optimal pricing model (LightGBM) to predict competitive prices for Airbnb listings.
- Use the model output to adjust pricing based on location, property type, seasonality, and other key factors identified.
-
Recommender System:
- Use the BaselineOnly recommendation model to provide personalized listing suggestions for guests based on review sentiment.
- Sentiment scores are calculated using the VADER sentiment analyzer, and listings with higher sentiment scores are prioritized for recommendations.
- Optimize Pricing: Adjust prices based on model recommendations to remain competitive in the market. Factors like location, amenities, and high-demand seasons should be leveraged.
- Enhance Guest Satisfaction: High sentiment scores are strongly associated with positive reviews. Hosts should focus on consistently maintaining quality amenities and providing excellent customer service to improve these scores.
- Use Data Insights for Marketing: Promote listings that have high occupancy rates, positive sentiment scores, and competitive pricing to attract new guests.
To replicate the results:
- Download the data files
listings.csv
andreviews.csv
from the Inside Airbnb website for Cape Town. - Ensure all required libraries are installed.
- Execute each notebook cell step-by-step, starting from data preprocessing to model evaluation.
-
Incorporate Additional Model Types:
- Explore ensemble models or advanced neural networks to improve predictive accuracy for pricing recommendations.
- Implement advanced recommendation algorithms, such as Matrix Factorization techniques, for a more personalized recommender system.
-
Dynamic Price Optimization:
- Develop a real-time pricing model that adapts to seasonal trends, regional events, and local demand surges.
- Leverage time series analysis to predict occupancy rates and optimize prices dynamically based on booking patterns.
-
Neighborhood Data:
- Enrich the dataset with location-based data such as crime rates, local attractions, and proximity to transportation hubs.
- Incorporate data on neighborhood demographics to better understand target guest preferences.
-
Market Competition Insights:
- Include data on competitive listings in the area, tracking prices, availability, and occupancy trends to provide more competitive pricing recommendations.
- Analyze guest reviews from nearby listings to capture broader market sentiment and preferences.
-
Host Activity:
- Integrate data on host activity, including response times, availability, and cancellation rates, to assess the impact on occupancy rates and guest satisfaction.
-
Guest Demographics:
- Incorporate information on guest demographics (e.g., origin, age, travel purpose) to tailor the recommendation system more effectively to different guest profiles.
- Analyze trends in guest demographics to identify potential new target audiences.
By incorporating these improvements, the model’s recommendations and insights can become more robust, relevant, and adaptable to changing market trends.
This project was made possible by the following resources and contributions:
- Inside Airbnb: For providing detailed Airbnb listings, reviews, and availability data used in this analysis.
- VADER Sentiment Analysis: The VADER (Valence Aware Dictionary for Sentiment Reasoning) tool for analyzing guest review sentiments.
- Surprise Library: For facilitating the implementation of collaborative filtering models in the recommendation system.
- Matplotlib, Seaborn, WordCloud: Libraries that enabled visualizations of data insights and trends in the exploratory data analysis.
A special thanks to the developers and contributors of these resources and tools for their contributions to open-source software, making projects like this possible.
This project was a collaborative effort by the following team members:
We are grateful for each other's contributions and commitment to this project.