Project Title

Machine learning model development for predicting house price in california

Implementation Details

Dataset: California Housing Dataset (view below for more details)
Model evaluated:
Input: 8 features - Median Houshold income, House Area, ...
Output: House Price

Dataset Details

This dataset was obtained from the StatLib repository (Link)

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surprisingly large values for block groups with few households and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the sklearn.datasets.fetch_california_housing function.

California Housing Dataset in Sklearn Documentation
20640 samples
8 Input Features:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per household
- AveBedrms average number of bedrooms per household
- Population block group population
- AveOccup average number of household members
- Latitude block group latitude
- Longitude block group longitude
Target: Median house value for California districts, expressed in hundreds of thousands of dollars ($100,000)

Exploratory data analysis

Correlation between features were carried out to see if highly correlated features are there, so that redundancy could be removed from the features.

Correlation shows longitude and latitude are highly correlated and one could be removed from the features list.

Model fitting and evaluation

Multiple models were evaluated for their performance and compared the R2 and MSE for the models to select the best model.

The performance of GradientBoostingRegressor model was found to be the highest with very low MSEerror compared to other models that are evaluated.

Model	R2	MSE
SVR	-0.020689	1.017586
LinearRegression	0.582674	0.416057
KNeighborsRegression	0.136115	0.861259
SGDRegressor	0.001655	0.995310
BayesianRidge	0.582681	0.416051
DecisionTreeRegressor	0.585701	0.413039
GradientBoostingRegressor	0.772826	0.226484

Model prediction comparasion with true values

Inference - shows a good colinearity which is also visible from the score.

Cross valadation

To evaluate the GradientBoostingRegressor model further and check for over fitting, cross valadation is performed.

Cross validation of the model with complete dataset with cv = 5 shows reduced score than thge model

Score 1	Score 2	Score 3	Score 4	Score 5
0.62413216	0.6943188	0.71206383	0.65481236	0.67672756

Cross validation of the model with split dataset shows similar accuracy as the fitted model.

Score 1	Score 2	Score 3	Score 4	Score 5
0.78189507	0.78282526	0.78389246	0.80503452	0.80055348

Inference - The model need further tuning to match the score in both the scanerio.

Key Takeaways

How to perform a basic ML model fitting and evaluate the performance of the model.

Code

The code is is avaiable in a python notebook model.ipynb. To view the code please click below

Click here

Roadmap

Model Exploration
Model Optimization
Hyperparameter Tuning
Exploring Other Ways to Improve Model

Libraries

Language: Python

Packages: Sklearn, Matplotlib, Pandas, Seaborn

Acknowledgements

Resources used

scikit-learn
OpenAI. (2024). ChatGPT (3.5) Large language model. https://chat.openai.com

Contact

If you have any feedback/are interested in collaborating, please reach out to me at LinkdIn

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
plots		plots
LICENSE		LICENSE
README.md		README.md
model.ipynb		model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Title

Implementation Details

Dataset Details

Exploratory data analysis

Model fitting and evaluation

Cross valadation

Key Takeaways

Code

Roadmap

Libraries

Acknowledgements

Contact

License

About

Releases

Packages

Languages

License

sonti-roy/california_housing

Folders and files

Latest commit

History

Repository files navigation

Project Title

Implementation Details

Dataset Details

Exploratory data analysis

Model fitting and evaluation

Cross valadation

Key Takeaways

Code

Roadmap

Libraries

Acknowledgements

Contact

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages