A project on exploratory data analysis.
Sebastian Thomas @ neue fische Bootcamp Data Science
(datascience at sebastianthomas dot de)
This was my first project at the neue fische Bootcamp Data Science. It was centered around exploratory data analysis techniques and simple predictive analysis using ordinary linear regression. After the bootcamp, the analysis was extended.
The instances in the data set represent house sales. The task is to describe the impact of the given features on the house sales prices resp. to predict the latter with machine learning methods.
We have the following key insights:
- The distribution of house sale prices is left modal, with a median of about 0.5 million US Dollars.
- Location has a big impact on house sale price as can be visualized by the median house sale prices grouped by the zipcode:
The area with the highest housesale prices is Medina with zipcode 98039, a city in Eastside in the metropol region of Seattle.
- There is a rough linear relationship between the living space area and the house sales price.
- While there is a rough linear relationship between house condition and the house sales price, the quality of the interior (design/materials) has an exponential impact on the house sales price.
- The better the view, the higher the house sales price. Most properties don't have an extraordinary view.
- If the house is on a waterfront, the median house sale price increases about 1 million US Dollars.
- The average error of the predictive model is about 12% (mean absolute percentage error) resp. $ 37,000 (median absolute error).
- Part 1: Data mining
- Part 2: Data cleaning
- Part 3: Feature engineering
- Part 4: Exploratory data analysis
- Part 5: Predictive analysis
- Part 6: Visualization
- try more regression algorithms
- try more ensemble methods
- try more feature selection methods
- try artificial neural networks