Scalable Machine Learning Pipeline

Scope

Efficient, scalable Machine Learning pipeline, that enables training and inference of large datasets that do not fit in memory by scaling up using fast storage.

Builds a ML pipeline on top of existing ML libraries (IBM Snap ML, scikit-learn), and using the AWS ML-IO library.

Usage

Setup conda environment

conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
conda config --add channels conda-forge
conda config --add channels mlio
conda create --yes -n smlp-environment python=3.7
conda activate smlp-environment

Install dependencies

conda install --file requirements.txt --yes

Install smlp module locally

python setup.py install

Run a test

python test/MLPipelineTester.py --ml_lib snap

Full pipeline test example

Epsilon dataset from the PASCAL Large Scale Learning Challenge.

for ch in 50000 100000 200000; do echo "chunk="$ch; python examples/smlp-demo.py --dataset_path /path_to_dataset/epsilon.train.csv --dataset_test_path /path_to_dataset/epsilon.test.csv --chunk_size $ch --ml_lib snap --ml_obj logloss --ml_model_options objective=logloss,num_round=1,min_max_depth=4,max_max_depth=4,n_threads=40,random_state=42; echo; done

Notes

Currenlty we support:

ML models: Snap Booster, sklearn Decision Trees
Input data format: csv

Dependencies:

Python (>= 3.7)
scikit-learn
numpy
pai4sk
mlio-py
psutil

License

This project is licensed under the Apached 2.0 License. If you would like to see the detailed LICENSE click here.

Contributing

Please see CONTRIBUTING for details. Note that this repository has been configured with the DCO bot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Scalable Machine Learning Pipeline

Scope

Usage

Setup conda environment

Install dependencies

Install smlp module locally

Run a test

Full pipeline test example

Epsilon dataset from the PASCAL Large Scale Learning Challenge.

Notes

Dependencies:

License

Contributing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Scalable Machine Learning Pipeline

Scope

Usage

Setup conda environment

Install dependencies

Install smlp module locally

Run a test

Full pipeline test example

Epsilon dataset from the PASCAL Large Scale Learning Challenge.

Notes

Dependencies:

License

Contributing