ICPSR 36404 Analysis

Analysis of the ICPSR 36404 dataset using descriptive machine learning. This work was produced as our final project for the Descriptive Learning discipline in Universidade Federal de Minas Gerais.

Paper

The paper (portuguese) for this work can be found under the paper directory.

Frequent Itemset Mining

Author: Gabriel Bastos gabriel.s.b@live.com

First, download the delimited version of the dataset. It is a tsv file, which is used as input for the analysis program.

Then, install the Rust stable toolchain.

Compile this project with cargo build --release. No additional steps should be necessary in order to compile.

The produced program provides the following usage:

analyzer 0.1.0
gahag <gabriel.s.b@live.com>


USAGE:
    icpsr-36404-analysis [SUBCOMMAND]

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    distribution    load the original dataset from stdin and display the data distribution
    help            Prints this message or the help of the given subcommand(s)
    load            load the serialized matrix from stdin and run the algorithm
    run             runs the entire pipeline
    save            load the original dataset from stdin and output the serialized matrix to stdout

icpsr-36404-analysis-run 
runs the entire pipeline

USAGE:
    icpsr-36404-analysis run [FLAGS] [OPTIONS] <min_sup>

FLAGS:
    -h, --help           Prints help information
        --recidivists    whether to include only recidivists
    -V, --version        Prints version information

OPTIONS:
        --admission-type <admission_type>    include only the given admission type [possible values: parole, new, other]
        --race <race>                        include only the given race [possible values: black, white, hispanic,
                                             other]
        --sex <sex>                          include only the given sex [possible values: male, female]

ARGS:
    <min_sup>    the minimum support ratio ([0, 1.0])

Subgroup Discovery

Author: Fernanda fernandaguimaraes28@gmail.com

First, install the necessary python packages to run the notebook:

pip install scikit-learn
pip install pandas
pip install datetime
pip install numpy
pip install pysugbroup

After, it is necessary to put the data file on the same directory, or update the path in the notebook:

data_path = "36404-0001-Data.tsv"

That's it. Now just run the notebook with Jupyter. You can also select the subgroup max_size by altering the depth parameter in the Subgroup Discovery section.

Additional work

The following Rust crates were developed in order to support this work:

dci
onehot
bitmatrix

Licence

This project is licenced under the MIT Licence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

ICPSR 36404 Analysis

Paper

Frequent Itemset Mining

Subgroup Discovery

Additional work

Licence

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

ICPSR 36404 Analysis

Paper

Frequent Itemset Mining

Subgroup Discovery

Additional work

Licence