Are They What They Claim: A Comprehensive Study of Ordinary Linear Regression Among the Top Machine Learning Libraries in Python
by Sam Johnson, Josh Elms, Madhavan K R, Keerthana Sugasi, Parichit Sharma, Hasan Kurban and Mehmet M. Dalkilic
This repository was created to display supplementary materials from the above-mentioned [paper] submitted to KDD2023. Below are steps to replicate the author's experiments in the paper.
To set up for a run of this pipeline, you will need to download and install the necessary libraries. Please ensure you have python >= 3.7. Determine whether you are using pip or conda (if you don't know, use the instructions for pip).
Pip users should run:
pip install -r requirements.txt
Conda users should run:
conda create --name <env> --file environment.yaml
conda activate <env>
If either of these commands fail or take more than a few hours to run, we recommend removing the mxnet
requirement from the requirements.txt
/ environment.yaml
file and retrying the install. You will have to comment out any use of mxnet
later on during the pipeline, however. Some of this conflict is unavoidable and due to each of the libraries used having varying dependencies.
Both experiments are initialized, run, and analyzed together because recording runtime and recording memory usage are two very similar tasks. Details about the theory behind these experiments can be found in the paper, but we will provide steps to replicate the results on your own system.
NOTICE: The memory usage experiment relies on Memray, which does not and is "unlikely to ever support Windows", as per their Supported Environments page. Accordingly, this experiment does not run on Windows machines. In order to run just the time profiling (and ignore memory), a Windows user could remove the memray import and the indented blocks in which memory profiling occurs in complexity_exper/data/complexity_experiment.py
. Additionally, the Memray works better on Linux than on Mac. Although it will function on Mac, the postprocessing notebook might throw numerous warnings/errors -- these can be ignored unless they halt program execution. Finally, the postprocessing of memory files MUST be completed on the same system which performed the experiment due to the intricacies of Memray.
-
Run the Linpack Benchmark
- Enter the complexity experiment's data generation directory
cd complexity_exper/data
- Compile the benchmark script with GCC or your preferred compiler
gcc linpack_benchmark.c -o linpack_benchmark.out
- Run the benchmark
./linpack_benchmark.out
- Record the processor speed (shown in image below) in the
MFLOPS
field ofcomplexity_exper/analysis/postprocessor.ipynb
- Enter the complexity experiment's data generation directory
-
Run the main experiment
- Set the initialization parameters according to their descriptions in
complexity_experiment.py
- Run the experiment
python complexity_experiment.py
- Move the output into
complexity_exper/analysis
cd .. mv data/complexity_results analysis/
- Set the initialization parameters according to their descriptions in
-
Run postprocessing
- Set the necessary parameters in the cell "User-Defined Parameters".
- Run all cells sequentially. Memray processing can be quite slow on Mac, but it should finish in less than an hour.
-
Run the visualization script
- Set the path to the output (
complexity_results
, if you are incomplexity_exper/analysis
) - Run the script
python visualization.py
- See
memory_figures
andruntime_figures
for experimental results, orprocessed_output
for exact values of the trends shown in the plots.
- Set the path to the output (
The results shown in this paper are under complexity_exper/analysis
. Experimental results are included for two of Indiana University's High-Performance Computing systems (Quartz and Carbonate), as well as a MacBook Pro.
- Run
circular_data_exper/data/create_data.py
- Run
circular_data_exper/analysis/run_lin_reg.py
- Run
circular_data_exper/analysis/aggregate_results.py
- Your result CSVs will be
circular_data_exper/analysis/final_results
folder and the their accompanying images will be incircular_data_exper/analysis/regression_pics
.
To run again, delete circular_data_exper/analysis/final_results
, circular_data_exper/analysis/outputs
, circular_data_exper/data/raw_data
, circular_data_exper/analysis/regression_pics
folders and circular_data_exper/analysis/cnt_#.txt
file. Start again with Step 1.
- Run
high_dimensional_exper/analysis/run_datasets.py
- Run
high_dimensional_exper/analysis/result_aggregation.py
The results of this experiment are already stored in the high_dimensional_exper/results
folder. The final CSV used in the paper is high_dimensional_exper/results/MAE_linreg_comparison.csv
.
Email Sam Johnson ([email protected]) for questions.
If you find this work useful, cite it using:
@article{johnson2023ols,
title={Are They What They Claim: A Comprehensive Study of Ordinary Linear Regression Among the Top Machine Learning Libraries in Python},
author={Johnson, Sam and Elms, Josh and Kalkunte Ramachandra, Madhavan and Sugasi, Keerthana and Sharma, Parichit, and Kurban, Hasan and Dalkilic, Mehmet M.},
year={2023}
}