This document describes the general structure and process for the full police-eis
project.
There are a number of repositories that are and have been used in the course of developing the police early intervention system. This document is designed to describe the purpose and content of each. There is detailed documentation about each of the repositories used.
For the police-eis
repository, simply run:
git clone --recursive https://github.com/dssg/police-eis.git
For the police-eis-private
repository, change your directory to be in the root level of the police-eis
repository, and then run:
git clone --recursive https://github.com/dssg/police-eis-private.git
The --recursive
flag is important because it will make sure to clone any submodules that are included in the repositories (currently this is only pg_tools).
1. Drake
Only required for loading the Nashville data (see below).
For package dependencies, see requirements.txt. NB: luigi will not work with Anaconda because its builds are out of date and sometimes the Python path is problematic.
3. Luigi
pg_tools needs to exist in the repositories when you run any of the luigi population scripts. If you cloned the repositories with the --recursive
flag, you should have them already. If you don't already have them, or you're getting errors that luigi cannot find pg_tools, you can try recloning pg_tools with the following commands:
cd [path to police-eis repo]\schemas\pg_tools
git submodule init
git submodule update
and for the police-eis-private
repo:
cd [path to police-eis repo]\police-eis-private\schemas\pg_tools
git submodule init
git submodule update
We use Drake to transfer the raw data from the department to the ETL schema (only the MNPD data uses this system). To run this process use the following command in the [path to police-eis repo]\police-eis-private\schemas\etl\
directory:
drake -w Drakefile_[department] -s ~/.s3cfg
We use luigi to move data from the ETL schema to the staging schema. Full documentation for this process (including repopulation) is available. There is additional detailed documentation on luigi and our setup.
Much, but not all, of the ETL to staging process has been automated with a bash script called 'run_luigi.sh' in the police-eis/schemas/
directory.
If the user would like to run this largely automated script on a staging-development schema, the schema can be specified in the first line of this file:
export schema=staging_dev
Change staging_dev
to an alternate name if desired, and from the terminal run: bash run_luigi.sh
.
To generate features, set the configuration of features (and labels) in a configuration file that has the same form as example_officer_config.yaml
. And then from the police-eis
directory, run:
python -m eis.run --config [configuration file name] --labels [labels configuration file name] --buildfeatures
Once the features have been built, you can run all of the models with:
python -m eis.run --config [configuration file name] --labels [labels configuration file name]
The models are stored in the results
schema. Within this schema, there are a number of tables:
models
contains all of the model information and configuration detailsevaluations
contains a number of metrics for each model. Each row is a metric for one model. Themodel_id
is a foreign key tomodel_id
in themodels
table.predictions
contains the predictions for each unit (officer or dispatch) that was predicted for each model. Themodel_id
is a foreign key tomodel_id
in themodels
table.data
contains model pickle files that can be read back into python if necessary.
The entire configuration file for every model run is stored in results.models.config
as a JSON object, which can be queried directly in PostgreSQL. An example of such a query is shown here:
select * from results.models where config->>'fake_today' = '01May2014'
In order to use a fit model in python (for visualization of ROC curves, model inspection, making predictions with a fitted model, etc.), simply query the database for the pickle, and then use pickle.loads()
import pickle
[sql-cursor].execute("SELECT pickle_blob FROM results.data WHERE model_id = 20")
result, = cur.fetchone()
data = pickle.loads(result)