Skip to content

Latest commit

 

History

History
102 lines (69 loc) · 4.47 KB

File metadata and controls

102 lines (69 loc) · 4.47 KB

Predicting Amsterdam house / real estate prices using Ordinary Least Squares-, XGBoost-, KNN-, Lasso-, Ridge-, Polynomial-, Random Forest-, and Neural Network MLP Regression (via scikit-learn)

Approach:

  • load Pandas DataFrame containing (Dec-17) housing data retrieved by means of the following scraper, supplemented with longitude and latitude coordinates mapped to zip code (via GeoPy
  • do some simple data exploration / visualisation
  • remove non-numeric data, NaNs, and outliers (everything above 3 x standard dev of y)
  • define explanatory variables (surface,latitude,and longitude) and independent variable (price EUR)
  • split the data in train and test sets (+ normalise independent variables where required)
  • find the optimal model parameters using scikit-learn's GridSearchCV
  • fit the model using GridSearchCV's optimal parameters
  • evaluate estimator performance by means of 5 fold 'shuffled' nested cross-validation
  • predict cross validated estimates of y for each data point and plot on scatter diagram vs true y

Packages required

Scores (5 fold nested 'shuffled'cross-validation - Rsquared)

1. XGBoost Regression

  • Parameters: max_depth: 5, min_child_weight: 6, gamma: 0.01, colsample_bytree: 1, subsample: 0.7
  • Score: 0.887

2. Random Forest Regression

  • Parameters: max_depth: 6, max_feat: None, n_estimators: 10
  • Score: 0.839

3. Polynomial Regression

  • Parameters: degrees: 2
  • Score: 0.731

4. Neural Network MLP Regression

  • Parameters: act: relu, alpha: 0.01, hidden_layer_size: (10,10), learning_rate: invscal
  • Score: 0.715

5. KNN Regression

  • Parameters: n_neighbours: 10
  • Score: 0.711

6. Ordinary Least-Squares Regression

  • Parameters: None
  • Score: 0.694

7. Ridge Regression

  • Parameters: alpha: 0.01
  • Score: 0.694

8. Lasso Regression

  • Parameters: alpha 0.01
  • Score: 0.693

Sample data input (Pandas DataFrame)

   surface  rooms_new  zipcode_new  price_new   latitude  longitude
0    138.0        4.0         1060     420000  40.804672 -73.963420
1    130.0        5.0         1087     550000  52.355590   5.000561
2    116.0        5.0         1061     425000  52.373044   4.837568
3     92.0        5.0         1035     349511  52.416895   4.906767
4    127.0        4.0         1013    1050000  52.396789   4.876607

Scatter plot - Surface vs. Asking Price (EUR)

alt text

XGBoost - Predicted prices vs. True price (EUR)

alt text

Random Forest - Predicted prices vs. True price (EUR)

alt text

Polynomial - Predicted prices vs. True price (EUR)

alt text

Neural Network MLP - Predicted prices vs. True price (EUR)

alt text

KNN - Predicted prices vs. True price (EUR)

alt text

OLS - Predicted prices vs. True price (EUR)

alt text

Lasso - Predicted prices vs. True price (EUR)

alt text

Ridge - Predicted prices vs. True price (EUR)

alt text