Efficient, scalable Machine Learning pipeline, that enables training and inference of large datasets that do not fit in memory by scaling up using fast storage.
Builds a ML pipeline on top of existing ML libraries (IBM Snap ML, scikit-learn), and using the AWS ML-IO library.
conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
conda config --add channels conda-forge
conda config --add channels mlio
conda create --yes -n smlp-environment python=3.7
conda activate smlp-environment
conda install --file requirements.txt --yes
python setup.py install
python test/MLPipelineTester.py --ml_lib snap
Epsilon dataset from the PASCAL Large Scale Learning Challenge.
for ch in 50000 100000 200000; do echo "chunk="$ch; python examples/smlp-demo.py --dataset_path /path_to_dataset/epsilon.train.csv --dataset_test_path /path_to_dataset/epsilon.test.csv --chunk_size $ch --ml_lib snap --ml_obj logloss --ml_model_options objective=logloss,num_round=1,min_max_depth=4,max_max_depth=4,n_threads=40,random_state=42; echo; done
Currenlty we support:
- ML models: Snap Booster, sklearn Decision Trees
- Input data format: csv
- Python (>= 3.7)
- scikit-learn
- numpy
- pai4sk
- mlio-py
- psutil
This project is licensed under the Apached 2.0 License. If you would like to see the detailed LICENSE click here.
Please see CONTRIBUTING for details. Note that this repository has been configured with the DCO bot.