Skip to content

Latest commit

 

History

History
58 lines (45 loc) · 6.14 KB

README.md

File metadata and controls

58 lines (45 loc) · 6.14 KB

nyctaxi

Prerequisites

  • Download the 2013 taxi data using this shell script.
    • To download the 2015 taxi data (includes both yellow and green taxi data but lacks medallion and hack license info), use this one. To load in R, use this script.
  • [This R script] (https://github.com/msr-ds3/nyctaxi/blob/master/exploratory_analysis/load_one_week.R) loads the csvs, adds necessary and convenient columns (e.g. neighborhood names) and saves them as taxi_clean in one_week_taxi.Rdata. To use the dataframe, simply call load('one_week_taxi.Rdata').
  • This R script uses taxi_clean to create a dataframe calles shifts_clean of drivers (hack_licenses) and their shifts (as measured by the cutoff analysis here), and a dataframe called taxi_clean_shifts with a shift number for each ride, and stores it in an Rdata file called shifts_clean.Rdata.

####NOTE: AS OF 7/26 YOU SHOULD MOVE ALL .RDATA FILES INTO THE RDATA FOLDER, AND SAVE ALL FUTURE RDATA FILES TO THAT FOLDER

##Descriptives

  • Cool figures, plots, and maps (output of some of the scripts below) are in this dir
  • This script creates a function (visualize_trips_by_shift) that can plot the route of a random taxicab driver over the course of a shift or a day of the week (visualize_trips_by_day).
    • Usage: visualize_trips_by_shift(df, hacklicense, shift = NULL). df is the dataframe (usually taxi_clean but sometimes a subset of that. hacklicense is the hack_license of the driver (usually randomly chosen from df). shift is optional - it takes a shift number; when ommitted, all shifts will be shown as a faceted plot. visualize_trips_by_day(df, hacklicense, day = NULL) works in a similar manner except that it can take in a particular day in the format "Mon", "Tue", etc.

Trip-based

  • Stats for one week of taxi rides by day of week, hour of day, pickup location, and dropoff location are computed by this R script.
  • Trip based descriptive plotting (distributions of distance, time, fare, etc) can be found here
  • Neighborhood popularity plots (in R) are here
  • Interactive popularity heatmaps by neighborhood can be created using this script
  • Ggmap (not-interactive) popularity heatmaps can be created using the functions in here

Driver-based

  • Driver based descriptive plotting (distributions of distance, time, fare, etc, by number of drivers) are here
  • Visualize shifts, and rides within them, for n random drivers by calling the visualize_rides_and_shifts() function created by this R script.

Shift-based

Predicting Efficiency

Predicting shift efficiency

  • Features to be included in the design matrix for the shifts prediction task are listed in this markdown file.
  • The design matrix can be created and saved as an Rdata file using the script here
  • Descriptive plots for both regression and classification for each individual feature here
  • Created some models and efficiency prediction here

Predicting driver efficiency

  • future work: Features to be included in the design matrix

Analyzing flow

Shiny apps

  • A shiny app to visualize NYC taxi flow as a heatmap can be found here
  • A shiny app (inspired by Todd Schneider's post) to visualize average trip times from neigborhood to neighborhood.
  • An app to see popular neighborhood destinations, and unusual neighborhoods.

Other work

De-anonymization

  • Java code that can de-anonymize medallions and hack licenses.

Games

  • Play the "predict the driver's efficiency" guessing game using this script.