- Datitos: Aprendizaje profundo
- T.P. N°2 - Aprendizaje Profundo 2021 by Datitos: Can you predict what position a soccer player plays based on the stats that FIFA keeps on him?
- anaconda / minconda
- MariaDB / MySQL
- Apache Airflow: See airflow-systemd to install Apache Airflow as systemd daemon.
- Optuna Dashboard: See optuna-dashboard-systemd to install Optuna Dashbord as systemd daemon.
- Private Score: 0.90000
- Public Score: 0.90038
- File: study16-predict-2022-01-23_07-44-35.csv
To automate complete train process(training, reports generation, kaggle file) already exist an Apache Airflow Dag. An airflow dag is a data workflow that runs N parallel training processes and then run reports generation and kaggle result generation steps.
Dag
Required airflow global variables
project_path
: /path/to/datitos/projectreport_folds
: 10report_seeds_count
: 30train_cuda_process_memory_fraction
: 0.1train_device
: gpu / cputrain_folds
: 5train_optuna_db_url
: mysql://root:1234@localhost/example (Database used by optuna to persist study state).train_optuna_study
: study16 (Optuna study name).train_optuna_timeout
: 8000 (Maximum time to wait for hyper parameters optimization).train_optuna_trials
: 300train_workers_count
: 4
You can run a training into N workers. Each worker can be seen as a trial executor job. Each job train a model with a set of specific hyper params. All hyperparams -score pairs are stored into a maridb db. Finally you can load optuna study to get best hyperparams with hiest score. You can run a worker as next:
Notes
- Each trainig process is a
bin/train_model.py
execution. - Optimization report step run
bin/optmimization_report.py
script. - Test model step run
bin/test_model.py
script. - See below to undestant how do each script.
GPU
$ conda activate datitos
$ python bin/train_model.py --device gpu \
--study study3 \
--cuda-process-memory-fraction 0.1 \
--folds 5 \
--trials 300 \
--db-url mysql://root:1234@localhost/example \
--timeout 5000
To run 10 workers repeat previous command into 10 distinct shell sessions (bash/szh).
On the other hand, you can run workers that use CPU or GPU. Normally a good configuration could be N GPU workers and maybe 1 CPU worker, because CPU workers are high CPU consuming processes. This could be limited by the type of CPU, GPU and GPU and RAM memory. CPU workers parallelze k fold cross validation to decrese response time. GPU workers cant parallelize cv.
CPU
$ conda activate datitos
$ python bin/train.py --device cpu \
--study study3 \
--folds 5 \
--trials 300 \
--db-url mysql://root:1234@localhost/example \
--timeout 5000
To monitor workers you can use any of next tools:
See script help:
$ python bin/train.py --help
Usage: train.py [OPTIONS]
Options:
--device TEXT Device used to train and optimize model.
Values: gpu, cpu.
--study TEXT The study name.
--trials INTEGER Max trials count.
--timeout INTEGER maximum time spent optimizing hyper
parameters in seconds.
--db-url TEXT Mariadb/MySQL connection url.
--cuda-process-memory-fraction FLOAT
Setup max memory user per CUDA procees.
Percentage expressed between 0 and 1
--folds INTEGER Number of train dataset splits to apply
cross validation.
--help Show this message and exit.
Generate plots for an specified optuna study. Next you can see generated plots for study16 (Best accuracy):
$ conda activate datitos
$ python bin/optmimization_report.py \
--study study6 \
--db-url mysql://root:1234@localhost/example \
--device gpu \
--seeds-count 3 \
--folds 2
See script help:
$ python bin/optmimization_report.py --help
Usage: optmimization_report.py [OPTIONS]
Options:
--device TEXT Device used to train and optimize model. Values: gpu,
cpu.
--study TEXT The study name.
--db-url TEXT Mariadb/MySQL connection url.
--report-path TEXT Path where save optimization plots.
--seeds-count INTEGER seeds count used calculate acuracy distribution
--folds INTEGER Number of train dataset splits to apply cross
validation.
--help Show this message and exit.
It script runs N model training instances using hyperparameters of optimization trial with best accurary. Then gets model with highest accuracy and predict under kaggle test file. Finally genera kaggle file to upload.
$ conda activate datitos
$ python bin/test_model.py \
--study study6 \
--db-url mysql://root:1234@localhost/example \
--device gpu
See script help:
$ python bin/test_model.py --help
Usage: test_model.py [OPTIONS]
Options:
--device TEXT Device used to train and optimize model. Values: gpu,
cpu.
--study TEXT The study name.
--db-url TEXT Mariadb/MySQL connection url.
--result-path TEXT path where test predictions are saved.
--help Show this message and exit.