- (i) A 200-300 word explanation of the expected performance of the model in terms of mean squared error and the key features driving the team’s modeling performance.
The model used for test2 data was Random Forest Regression. We expect a mean square error of roughly 3 million. When predictions were made using RFR on test1 data, the mean square error came out to 3.3 million. Combing that predicated data of roughly 2000 samples with the previous 12000 should somewhat improve the accuracy of our predications. Thus, having more data should bring the mean square error down by a few hundred thousand. Our idea behind using Random Forests was that using an ensemble method would work better on data with a lot of features. However, the Bias-Variance Tradeoff seems to have gotten in the way, as our trees did not beat out linear regression. One of the initial driving factors was we would have enough trees for the forest, so the model would avoid overfitting. Another factor was that initially plotting of the data showed a non-linear relation. We hoped to exploit the non-linearity in the data by dividing the space into smaller sub-spaces which decision trees excel at. Outlier observations are just being dropped which is not the best way to handle them for this type of dataset probably causing a higher mean square error.
- (ii) A 200-300 word summary outlining the team’s intended strategy to improve the predictions for the final round.
We can improve the prediction by implementing gradient boosting for regression and checking the model. So gradient boosting can lead to better performance because of its nature. GBMs build an ensemble of weak successive trees with each tree learning improving on the previous errors. Each predictor is trained using residual errors of its predecessor's as labels. It is also good for handling missing data and has lots of flexibilty than Random Forest Regression. This will lead the more better prediction by achieveing less mean square error than our prior work with handling this samples. Our team have also intended not to limited to Gradient boosting but trying and implementing different regression model so that we could achieve better results and performance.
While our model is performing well, it is not as optimal as we hoped. One of the shortcomings could be due to missing data. While there is not a lot of missing data, even small increases in accuracy might help reduce our mean square error. Currently we are imputing data with the median of existing data. The plan is to predict the missing data values using either linear regression or random forests, depending how linearity of the existing data. We will also further treat outliers in the dataset: they are naively being dropped (more research is to be done in how to optimize these observations). Furthermore, we plan to combine the given dataset with data from the U.S. Census Bureau. Hopefully, given additional data, our model will be a better fit (without overfitting). Another thing to explore is whether there is a better ensemble algorithm for our dataset; we will attempt to build a Gradient Boosting model to see if we can beat our current score. If that is unsuccessful, we will attempt to further refine our hyperparameters for random forests given we are only varying a few of the all the ones available right now. Another idea (though somewhat unrealistic) is to try to create new features based on the existing features.