Team: Sagun Pandey, Ahsun Rasool, Phurpa Sherpa, Sanjay Gurung
The goal of this project is to apply our data handling and modeling skills taught in the class to a real world data set. Our task is to predict asking rents for and answer several modeling questions pertaining to for New York City apartments posted on StreetEasy, an online marketplace for New York City homes. Predictions will be judged on the mean squared error of our estimated rents for the provided test sets.
Important: The datasets from NYC Open Data are large and therefore exceed Githib's 25 MB upload limit. If attempting to replicate the whole modeling process from the start, make sure to download them to your local machine and change the import path. URLs to the datasets are provided in the notebook cells.
The data sets for the project come from a random selection of homes posted for rent on StreetEasy during the summer of 2018. A training set with a sample of 12,000 homes posted in May, June, and July of 2018, along with their respective asking rents and several details pertaining to their listing on StreetEasy, including publicly posted bedroom count, bathroom count, descriptions, and select building and unit amenities. We are required to generate predictions on a random set of listings posted on StreetEasy during August 2018. One full set, including observed rents, is provided with the project posting. We are required to submit predicted rents on two additional sets, including test2 and test3, which do not include the observed rents.
We are expected to attach at least one additional data set to the set provided. The data set includes several data points designed to facilitate attaching additional third party data sets to the StreetEasydata set. Examples of these include the street address, latitude and longitude, and New York City BIN and BBL numbers. Additional data could come from the U.S. Census Bureau, New York City open data, the NYC Geoclient or any number of other open sources.
-
csv with predictions against test2.csv
-
A 200-300 word explanation:
- Expected performance of the model in terms of mean squared error
- Key features driving the team’s modeling performance.
-
A 200-300 word explanation:
- intended strategy to improve the predictions for the final round
-
csv with predictions against test3.csv.