Partner: City of Cincinnati, CincyStat
The City of Cincinnati receives approximately 74,000 medical or fire-related 9-1-1 calls per year. In each of these calls, a dispatcher must make a snap judgment about whether or not one of the twelve ambulances will be needed to provide medical transport. Given the scarcity of ambulances, it is important that dispatches are as accurate as possible — sending ambulances as quickly as possible when necessary and not sending them unnecessarily.
DSSG is partnering with the City of Cincinnati to identify patterns and predictors of medical transport at the dispatch level. We hope to assist the EMS dispatchers in more accurately sending ambulances when they are needed and improving patient care in Cincinnati.
- Issues
- DSSG Page
- DSSG Poster TODO
- DSSG TECHNICAL PLAN TODO
- GovTech article
- NPR article
This project is using pyenv
. We prefer pyenv
to virtaulenv
since it is simpler without sacrificing any useful feature.
Following the standard practice in pyenv
the version of python
is specified
in .python-version
. In this particular case is 2.7.8
.
Run the following
make prepare
make deploy
See the db_profile.example
and put in your real database login credentials. Remove the .example
when finished.
See cincinnati_ems_logging.conf_example
and modify accordingly. Rename the file by removing the example
when finished.
See etl/pipeline/luigi.cfg
and modify the file accordingly. Do not rename.
drake
mdbtools
psql
wget
unzip
The folder eda
contains the bulk of the exploratory data analysis done for this project. We will describe in generality what each jupyter notebook (ending in .ipynb
) was used for.
Comments_Column.ipynb
: The data provided by the City of Cincinnati contains a column with comments about each incident. These comments contain information in a much more freeform format than the other fields in the data. An important use of these comments is to determine whether or not a hospital transport was required for each type of incident. This notebook begins some of these explorations using comment data to determine need for medical transport.
DD2.ipynb
: Throughout the summer, we gave two 30 minute presentations called 'Deep Dives' where we presented our methods and findings to a broader group of peers. These presentations often required the creation of customized plots and required deeper data analysis. This notebook contains the exploration steps and model generation steps for the second of these presentations.
Data Exploration.ipynb
: This was one of the very first notebooks we made and contains the code for and answers to the questions outlined at the start of the notebook.
Dictionary Updates.ipynb
: This notebook contains the results of trying to construct a static dictionary of dispatch codes and responses for the City of Cincinnati. That is, currently the City of Cincinnati uses a fixed dictionary that maps each type of incident to the appropriate vehicle response for that incident. This notebook attempts to use past histories of transports to create a more efficient dictionary.
HeatMaps.ipynb
: This notebook contains self-contained code for creating heat maps for citywide incident frequency. The first heat map created maps incident frequency by hour of day and the second maps incident frequency by day of month. They are useful learning tools for importing and manipulating the data.
Pandas Test.ipynb
: This notebook is a very simple test which shows how to connect to the database and how to read data from a table.
Plots_For_PartnerCall.ipynb
: Through the summer, we had weekly calls with our partners in Cincinnati. At many of these calls, we showed some recent findings. The code used to generate plots for one such call is stored in this notebook. The particular plots generated by this notebook outline how busy the medic units in Cincinnati were holding fixed a day in the year and varying over different years. The purpose of this is to find out if there are potential differences in medic transport demand across years.
RegressionTest.ipynb
: This notebook really contains two separate analyses. The first part of the notebook contains introductory analysis to predicting medical transport demand in a given part of the city in a given time frame. This was eventually transitioned into pure python scripts. The second half of this notebook contains a method with which to select the best model(s) from all models that have been run up until this point. The method used to do this is by fixing individual k levels for the training set at which to send medical transport for each of the three types of urgency cases and then to try these k values on the test set.
TimeSeriesAnalysis.ipynb
: Some of the models that were considered for predicting medical transport demand were time series models (ARIMA, VAR, etc.). This notebook is the introductory analysis for building these time series models. The general structure that we followed was to plot each time series in question, plot the autocorrelation functions for this time series, and then try ARMA/ARIMA models on the time series.
Transport_Accuracy_for_Luigi_Pipeline.ipynb
: This was the introductory notebook used to test machine learning models on our data. In this notebook, we try using Logistic Regression on our data and output the gain in accuracy we attain by using this Logistic Regression as opposed to random guessing.
code_quantification.ipynb
: This was an introduction to understanding statistics of the various EMS codes in our data. Each incident code consists of three parts, the code type which indicates what kind of incident this is, the code level which indicates how severe this incident is, and the rest of the code which often provides some additional information.
data_stories_for_deep_dive.ipynb
: This was code used for our first DeepDive presentation. In that presentation, we tracked the movements of two medical transport units throughout a given day to understand potential losses in efficiency and better understand our data.
The following sections will give a walkthrough of how to modify the pipeline, including adding features, changing models used, etc. for the medical transport binary prediction model. That is, the issue of predicting whether or not medical transport will be required for a particular incident or not.
In order to add a new feature to the pipeline, start by opening model_specification.yaml
. Follow the format of the existing features:
name
is the name of your feature
type
is the data type of your feature (cat = categorical)
deps
is a full list of all columns in the semantic table that are needed to compute your feature
incl
is a boolean flag indicating whether or not to include this feature as a column when generating the feature table
Now that you have indicated the presence of your new feature, you will need to write the code to compute this feature.
Open etl/pipeline/features.py
Create a new function with the following naming convention. If, in the model_specification.yaml
file, you called your feature example
then name your function example_feature
so that the function definition line looks as follows:
def example_feature(df):
Here, df is a dataframe from the sementic table that contains at least the columns you indicated as dependencies in the model_specification.yaml
file.
You should have the function return a pandas
series containing information about your feature. This series will later be appended to the rest of the feature table.
Navigate to etl/pipeline/luigi.cfg
and under the [SampleData]
modify the following parameters:
initial_train_year
is the first year in the training set
train_set_numyear
is the number of years in the training set
initial_test_year
is the first year in the testing set
test_set_numyear
is the number of years in the testing set
Navigate to etl/pipeline/luigi.cfg
and under the [RunModels]
modify the following parameter:
models_used
is a list of shortnames for the various machine learning models you can use.
For a full list of such models and their shortnames, navigate to etl/pipeline/run_models.py
.
Some common shortnames are RF
for Random Forrest and LR
for Logistic Regression.
Navigate to etl/pipeline/run_models.py
and find a dictionary variable called grid
. This contains all the parameters which will be used for each classifier. Note that setting n
values for one parameter and m
values for another parameter will result in nm
models being run since the grid search automatically runs a model for each possible combination of parameters. Please see the scikitlearn documentation for which parameters you can use for which models. Note also that you can check the scikitlearn library of classifiers to add a completely new classifier.
Navigate to etl/pipeline/luigi.cfg
and under the [RunModels]
modify the following parameter:
features
is a list of features to use in each of the models in models_used
.
Note that if you want to include a family of features which all start with a similar prefix you need only include that prefix in the features
parameter list. For example, if you have three features time_day
, time_month
, and time_year
, you need only write time_
in features
to include all three.
Navigate to etl/pipeline/luigi.cfg
and under the [RunModels]
modify the following parameter:
code_buckets
is a list of incident code types by the urgency level they fall into
To modify this parameter simply move incident code numbers between urgency lists. The particular list that an incident falls into dictates the weighted score it receives when doing model evaluation.
Navigate to etl/pipeline/luigi.cfg
and under the [RunML]
modify the following parameter:
code_weights
is a list of weights to use when performing model evaluation
Each urgency level contains a list of four numbers A B C D
. They represent the following:
A
: Reward for a True Positive
B
: Penalty for a False Negative
C
: Penalty for a False Positive
D
: Reward for a True Negative
The results of all models that successfully run are stored in the schema model
in the table lookup
and should contain the following columns:
TIMESTAMP
is an indication of when the model finished
ID
is the full name of the model including parameters
JSON
contains relevant information about the model and will be expanded upon below
PICKLE
contains the scores assigned to the training set and testing set and will be expanded upon below
Each json
file in the JSON
column contains the following key model metadata information:
BUCKET NAME
is the incident urgency level that this model was run on
MODEL METRICS
contains information about the performance of the model at the top k% of the predictions where k is either 5, 30, 60, or 90. At each level of k, we record four items, (precision at k, recall at k, AUC, weighted score at k). Weighted score at k is defined as the linear combination of model prediction outcomes (True Positives, False Negatives, False Positives, True Negatives) with the weights assigned in the luigi.cfg
file.
BUCKET CODES
contains which incident code types fall into the bucket this model is run on
MODEL FEATS & IMPS
is a list of doubles containing each feature used by this model and its associted feature importance. Note that categorical features are split up into multiple binary features, each of which gets a feature importance. For example, if the feature table contained a categorical feature such as day_of_week
which took integer values 0 - 6 inclusive, this feature will be represented as six features: day_of_week_is_Monday
, ..., day_of_week_is_Saturday
. We must leave one binary variable out to avoid multicollinearity issues when doing our modeling
LABEL
is the label that we tried to predict using this model
MODEL TIME
is the time that this model completed
WEIGHTS
is the list of weights indicated by the luigi.cfg
file
MODEL SPEC
is the full name of the model including parameters
The PICKLE
column contains a path to the pickled dataframe of results. Each pickle
file contains a dictionary with keys for the training and test sets. The values are the dataframes associated with the scores assigned by the model to the training and test sets. These dataframes can be used to calculate precition, recall, or weighted score at arbitrary thresholds rather than just those prescribed in the JSON
file.
The following sections will give a walkthrough of how to modify the pipeline, including adding features, changing models used, etc. for the medical transport demand prediction model. That is, the issue of predicting how many medical transports there will be in a given geography in a specified time period.
This is a mirror of the Adding Features section of the Medical Transport Binary Prediction problem except for a few changes. Use the model_specification_demand.yaml
file when indicating which features should be included in the features_demand.master
table. Use the columns of the semantic_demand.master
when considering which dependencies to include.
This process is identical to its counterpart in the Medical Transport Binary Prediction problem.
Navigate to etl/pipeline/luigi.cfg
and under the [SampleDataReg]
modify the following parameters:
initial_train_year
is the year at which to begin the training set
days_to_predict
is the number of days we would like to predict. It will automatically take the last days in the entire dataframe.
For example, if our data consists of the full years 2013, 2014, and 2015, and we set initial_train_year
to 2013 and days_to_predict
to 7, our testing set will be the final 7 days of 2015 and our training set will be 2013, 2014, and all of 2015 save the last 7 days.
This is identical to its counterpart in the Medical Transport Binary Prediction section except for a few changes. The file etl/pipeline/run_reg_models.py
contains the names and shortnames of all the regression models used.
You can change the models used in the etl/pipeline/luigi.cfg
file under the task [RunModelsReg]
.
Some common shortnames are LINEAR
for Linear Regression and LASSO
for Lasso Regression.
Navigate to etl/pipeline/run_models_reg.py
and find a dictionary variable called grid
. This contains all the parameters which will be used for each regression model. Note that setting n
values for one parameter and m
values for another parameter will result in nm
models being run since the grid search automatically runs a model for each possible combination of parameters. Please see the scikitlearn documentation for which parameters you can use for which models. Note also that you can check the scikitlearn library of regressors to add a completely new regressor.
This is identical to its counterpart int eh Medical Transport Binary Prediction section except for a few changes. The task [RunModelsReg]
contains the features used in the models.
Navigate to etl/pipeline/luigi.cfg
. In the task [RunModelsReg]
the following parameters control the metric by which the various regression models are evaluated:
final_weight
: The accuracy of our predictions is weighted differently based on how far into the future they are. In general, accuracy of predictions one day out are weighted more heavily than are predictions seven days out, for example. But, we still want the predictions last prediction to have some weight. This parameter sets the weight of that last prediction, assuming the weight on the first prediction is 1.
schedule
: This parameter sets the schedule by which our weights on accuracy of predictions decay over time. It currently can take one of two values: exp
or lin
. exp
designates an exponentially decaying weight structure while lin
designates a linearly decreasing weight structure.
overunder
: We generally want to penalize high inaccuracies in underpredictions (predicting too few medical transports) greater than high inaccuracies in overpredictions (prediction too many medical transports). The exact relative weights we use are set by this parameter. The parameter is set with two numbers, separated by a space (such as: 1 3
). The first number indicates the penalty on overpredictions and the second is the penalty on underpredictions.
Using these three parameters, we build our evaluation function, which is a time weighted mean absolute percentage error with variable weight on under and over predictions.
where
We assume here an exponential decay schedule. A linear schedule would result in the exponential term being replaced by a linear one.
The model which gives the lowest value for this evaluation metric has the lowest time - weighted mean absolute percentage error from the true demands, taking into account number of over and under predictions.
The results of all models that successfully run are stored in the schema model_demand
in the table lookup
and should contain the following columns:
TIMESTAMP
is an indication of when the model finished
ID
is the full name of the model including parameters
JSON
contains relevant information about the model and will be expanded upon below
PICKLE
contains the scores assigned to the training set and testing set and will be expanded upon below
Each json
file in the JSON
column contains the following key model metadata information:
MODEL METRICS
contains four numbers which give different measures of model performance. These four numbers are, in order: Mean Absolute Percentage Error, Root Mean Squared Error, Mean Absolute Error, and Time Weighted Mean Absolute Percentage Error (the metric outlined in the previous section).
MODEL FEATS & IMPS
is a list of doubles containing each feature used by this model and its associted feature importance. This has the same caveats as its counterpart in the Medical Transport Binary Prediction problem.
LABEL
is the label that we tried to predict using this model
MODEL TIME
is the time that this model completed
STATION
is fire station for which this model was run
MODEL SPEC
is the full name of the model including parameters
The PICKLE
column contains a path to the pickled dataframe of results. Each pickle
file contains a dictionary with keys for the training and test sets. The values are the dataframes associated with the scores assigned by the model to the training and test sets. These dataframes can be used to calculate various regression accuracy measures such as MAPE, RMSE, MAE, as well as user defined evaluation metrics.
See the README
in infrastructure
- See the list of issues
- Move the data files to
aws s3
and usesmart_open
for a transparent write/read operations