https://www.kaggle.com/pranavbadami/nj-transit-amtrak-nec-performance?select=2018_11.csv
Developing a tool for the NJ Transit app, specifically the Trip Planner, to let customers know which trains might be delayed.
How could data make a difference in answering this question? Do you have a sense for the business as usual decision making?
OpenData is the basis of our tool; transit companies and agencies likely have more intricate data that they use to predict delays and decide how to internally optimize departures.
We are mainly using a dataset from Kaggle, which scrapes its data from the NJ Transit DepartureVision Real Time Train Status service. We are also using weather data from the RIEM R package.
The model is a regression model, with delay being the dependent variable. By using a regression model, we hope to calculate a more precise estimate of delay than we would just by calculating the overall mean. We will divide possible delay (considered >10 minutes) by different brackets, giving customers varying notifications by different times of delay (e.g. 10-20min, 20-30min, etc.)
How will you validate this model (cross-validation & goodness of fit metrics that relate to the business process)?
We will validate this model through cross-validation, because we are mostly predicting categorically for different brackets of time, and our data is continuous numeric data.
Our first stakeholders, customers, would use this data to inform their ticket purchasing decisions. If transit agencies or companies become interested in our product, it could also be used for internal predictions and management.
The app will be a plugin or pop-up on ticket purchasing sites for NJ Transit and Amtrak for different lines and times of day, showing the likelihood of delay and how long delay might be for that stop or train.
https://njogis-newjersey.opendata.arcgis.com/
NJ Transit rail station: https://njogis-newjersey.opendata.arcgis.com/datasets/NJTRANSIT::rail-stations-of-nj-transit/about
NJ Transit light rail station: https://njogis-newjersey.opendata.arcgis.com/datasets/NJTRANSIT::light-rail-stations-of-nj-transit/about
Amtrak Station: https://geo.dot.gov/server/rest/services/NTAD/Amtrak_Stations/MapServer/0
- PPM - the percentage of trains arriving at their destination within 5 minutes of schedule
- On Time - the percentage of trains arriving at the scheduled time at each station stop on a journey
Lots of interesting, high-impact projects could be driven by this data:
Robust prediction: This data could be used to derive a system-level prediction system for the NJ Transit network. Such a system could provide intelligent, targeted advance warnings of delays or cancellations for millions of riders. Combining datasets: Weather data and service alert data could be incorporated to look at the effect of weather events and analyze the impacts of specific kinds of service interruptions. Data visualization: Visualizing this data could provide robust insight into the system-level mechanics of the NJ Transit rail network, as well as more engaging reporting on NJ Transit.
For some more inspiration, you can check out Medium articles written by Michael Zhang and me with this data:
The 5 Stages of a System Breakdown on NJ Transit What are the chances that NJ Transit will cause you to miss the Dinky? How data can help fix NJ Transit