Skip to content

TobiasWeis/kaggle_sf_crime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kaggle_sf_crime

This code is for the Kaggle San Francisco crime challenge (https://www.kaggle.com/c/sf-crime). It contains a data loader with preprocessing and two main files. The first trains single classifiers and evaluates them using logloss (also used in the competition), the second one (main_search.py) uses the randomized search of sklearn for hyperparameter estimation.

To get a feel for the data, visualization.py plots some statistics about the dataset.

The first try:

  • Random Forest Classifier (clf = RandomForestClassifier(max_depth=16, n_estimators=1024, n_jobs=48)) placed 580/2335 with a logloss of 2.41519 (number one entry: 1.95936)

Data

I did not want to checkin the raw data (too big), but I also hate searching data in the future, so I zipped the kaggle data. Just unzip data/kaggle_data.zip, and you have everything you need

Plots

Some plots using pandas and seaborn Global stats

Number of Crimes per Hour of Day for each Category

Number of Crimes per Attribute for the Top 5 categories

Map for visualization

map-creation: script in utils (get_map_and_save.r): [0] use ggmap package of R, specify lat/lon box, retreive map.

Two options:

  1. save to rds file w/ gray values: use python script to reload rds and plot [1], mapdata = np.loadtxt("outputmap.txt")
  2. the colored image mapfile is created by ggmap (ggmapTemp.png), can be loaded and in matplotlib set extent to lat_lon_box

Example using the second option: Map plot for a specific category with Kernel Density as Heatmap

[0] https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/ggmap/ggmapCheatsheet.pdf

[1] https://www.kaggle.com/benhamner/sf-crime/saving-the-python-maps-file

About

code for the kaggle san francisco crime challenge

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published