AReS Startup Guide

This repository is intended to allow readers of "AReS: An AutoML Regression Service for Data Analytics and Novel Data-centric Visualizations" [paper] to replicate the authors' results.

The website referred to in the paper runs code which is functionally identical to the AutoML code in this repository.

Setup

To set up for a run of this pipeline, you will need to download and install the necessary libraries. Please ensure you have a python >= 3.7. Determine whether you are using pip or conda (if you don't know, use the instructions for pip).

Pip users should run:

pip install -r requirements.txt

Conda users should run:

conda create --name <env> --file environment.yaml
conda activate <env>

Testing

To run AReS for yourself, you will have to set the parameters for the run in code/AutoML/main.py. This file calls code/AutoML/utils.py, which contains all of the computation. main.py is already configured for a sample run, but some of the notable parameters that can be set are

datapath

The filepath of the data that you would like to run AReS on. AReS will attempt to predict the last column using the rest of the columns. The case studies presented in the paper are located at datasets/kaggle_datasets (concrete and paris), but other examples can be found in datasets/test_datasets.
which_regressors

This dictionary determines which of sklearn's regressors will be used by AReS, indicated with 1s.
figure_lst

This is a list of the figures discussed in the paper will have their data produced. We do not provide code to duplicate our visualizations here, as the paper shows images from a web-based tool called Apache ECharts.

See the rest of the parameters in code/AutoML/main.py

Once the parameters have been set, run AReS with

python code/AutoML/main.py

Output will be produced in your working directory.

Analysis

The code for AReS produces up to four files, which are detailed below

perf_stats_<id>.csv

Columns are <regressor>~<error_metric> for each regressor and error metric specified in config parameters. Rows are the 1-n folds specified for k-fold cross validation. Each value indicates how well that model (1 of n for each regressor) performed in cross validation according to a particular metric.
perf_stats_quantity_curve_<id>.csv

Columns are <regressor>~<error_metric> for each regressor and error metric specified. Rows are 10-100, indicating what percent of the available data models in that row were trained on.
perf_stats_point_errors_<id>.csv

Column 0 is <regressor>~p<point_num>. Column 1 is Mean Absolute Percentage Error (MAPE). Each value in Column 1 indicates how well that regressor performed over the particular point, according to its MAPE score.
perf_stats_test_best_models_<id>.csv

Column 0 is the particular regressor. Columns 1 and on are metrics. Each value indicates the how well the regressor performed in prediction on the initially held-out data, according to some error metric. Only the m best models are shown here, according to the parameter set in program configuration.

All of the figures in the paper can be generated using the above generated data. AReS performance against other Kaggle submissions can be determined by

Set AReS' datapath parameter to point to a dataset downloaded from a Kaggle competition.
Run AReS to predict the target values for the downloaded dataset.
Find the best RMSE from perf_stats_test_best_models_<id>.csv for use in step 6.
Download the submission leaderboard for that competition (example) and record the file name (not path) for step 6.
Place leaderboard file under datasets/kaggle_leaderboards/.
Run each cell in code/case_studies/create_distribution.ipynb after setting ARES_RMSE and LEADERBOARD from steps 3 and 4, respectively.

Email Joshua Elms ([email protected]) for questions.

If you find this work useful, cite it using:

@article{elms2023ares,
  title={AReS: An AutoML Regression Service for Data Analytics and Novel Data-centric Visualizations},
  author={Elms, Josh and Johnson, Sam and Kalkunte Ramachandra, Madhavan and Sugasi, Keerthana and Sharma, Parichit, and Kurban, Hasan and Dalkilic, Mehmet M.},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.assets		.assets
code		code
datasets		datasets
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AReS Startup Guide

Setup

Testing

Analysis

About

Releases

Packages

Languages

sej2020/AReS-whitepaper-supplementary-materials

Folders and files

Latest commit

History

Repository files navigation

AReS Startup Guide

Setup

Testing

Analysis

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages