Machine learning model development for predicting house price in california
- Dataset: California Housing Dataset (view below for more details)
- Model evaluated:
- Input: 8 features - Median Houshold income, House Area, ...
- Output: House Price
This dataset was obtained from the StatLib repository (Link)
This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).
A household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surprisingly large values for block groups with few households and many empty houses, such as vacation resorts.
It can be downloaded/loaded using the sklearn.datasets.fetch_california_housing function.
- California Housing Dataset in Sklearn Documentation
- 20640 samples
- 8 Input Features:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per household
- AveBedrms average number of bedrooms per household
- Population block group population
- AveOccup average number of household members
- Latitude block group latitude
- Longitude block group longitude
- Target: Median house value for California districts, expressed in hundreds of thousands of dollars ($100,000)
- Correlation between features were carried out to see if highly correlated features are there, so that redundancy could be removed from the features.
- Correlation shows longitude and latitude are highly correlated and one could be removed from the features list.
- Multiple models were evaluated for their performance and compared the R2 and MSE for the models to select the best model.
- The performance of GradientBoostingRegressor model was found to be the highest with very low MSEerror compared to other models that are evaluated.
Model | R2 | MSE |
---|---|---|
SVR | -0.020689 | 1.017586 |
LinearRegression | 0.582674 | 0.416057 |
KNeighborsRegression | 0.136115 | 0.861259 |
SGDRegressor | 0.001655 | 0.995310 |
BayesianRidge | 0.582681 | 0.416051 |
DecisionTreeRegressor | 0.585701 | 0.413039 |
GradientBoostingRegressor | 0.772826 | 0.226484 |
- Model prediction comparasion with true values
Inference - shows a good colinearity which is also visible from the score.
To evaluate the GradientBoostingRegressor model further and check for over fitting, cross valadation is performed.
- Cross validation of the model with complete dataset with cv = 5 shows reduced score than thge model
Score 1 | Score 2 | Score 3 | Score 4 | Score 5 |
---|---|---|---|---|
0.62413216 | 0.6943188 | 0.71206383 | 0.65481236 | 0.67672756 |
- Cross validation of the model with split dataset shows similar accuracy as the fitted model.
Score 1 | Score 2 | Score 3 | Score 4 | Score 5 |
---|---|---|---|---|
0.78189507 | 0.78282526 | 0.78389246 | 0.80503452 | 0.80055348 |
Inference - The model need further tuning to match the score in both the scanerio.
How to perform a basic ML model fitting and evaluate the performance of the model.
The code is is avaiable in a python notebook model.ipynb. To view the code please click below
- Model Exploration
- Model Optimization
- Hyperparameter Tuning
- Exploring Other Ways to Improve Model
Language: Python
Packages: Sklearn, Matplotlib, Pandas, Seaborn
Resources used
- scikit-learn
- OpenAI. (2024). ChatGPT (3.5) Large language model. https://chat.openai.com
If you have any feedback/are interested in collaborating, please reach out to me at LinkdIn