Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orchestration train #10

Open
wants to merge 43 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
3a50201
start to organize Feature engineering's part
infini11 Jun 2, 2022
17ca4ec
Some steps feature engineering and tests are written
infini11 Jun 3, 2022
14a5d84
But they need to be improved
infini11 Jun 3, 2022
59ed447
Refarctoring code step : some variables need to be updated
infini11 Jun 8, 2022
90fb761
Merge branch 'dev' of https://github.com/Aura-healthcare/seizure_dete…
infini11 Jun 8, 2022
6239ee5
Merging file train_model.py
infini11 Jun 8, 2022
c256e7b
Update train_model.py
infini11 Jun 10, 2022
ecfb651
Some refactoring about feature engineering briks
infini11 Jun 10, 2022
ce484bc
Feature eng and test feature eng. Need be improved
infini11 Jun 13, 2022
42dd73b
Refacto:
infini11 Jun 24, 2022
0bb60a5
Refacto:
infini11 Jun 24, 2022
7a291f3
Update: xgb model params
infini11 Jun 27, 2022
e441e89
Merge branch 'upcoming_sakhite' of https://github.com/Aura-healthcare…
infini11 Jun 27, 2022
dc62f50
Fix: bug from test_feature_engineering
infini11 Jun 30, 2022
46da842
add test data
infini11 Jun 30, 2022
7ce55b5
Fix: bug in feature engineering file
infini11 Jun 30, 2022
a5ce78b
Fix : file path for github
infini11 Jun 30, 2022
eb817ba
REFACT: Feature engineerning
infini11 Jul 6, 2022
927a144
Merge branch 'dev' of https://github.com/Aura-healthcare/seizure_dete…
infini11 Jul 6, 2022
0b0c1f7
Dev : Training orchestration for seizure detection
infini11 Jul 18, 2022
e6d102f
Refacto(debut): Feature eng to data_loading, data_cleaning and time_s…
infini11 Jul 19, 2022
420fb86
Refact : Time series processing
infini11 Jul 19, 2022
15cb66d
Refact : Test time series processing
infini11 Jul 20, 2022
2dc0758
Refacto: Time series processing
infini11 Jul 25, 2022
9d40b7c
Refacto: Adding constants file
infini11 Jul 25, 2022
991d6a9
Feature preparation and test feature preparation
infini11 Jul 25, 2022
30f0a57
Refacto: prepare feature pipeline
infini11 Jul 26, 2022
0845784
Refacto : ML pipeline orchestration
infini11 Jul 27, 2022
93eb305
docker-compose file
infini11 Aug 1, 2022
f78e3df
Change mlflow version
Aug 3, 2022
635457f
Merge branch 'orchestration_train' of https://github.com/Aura-healthc…
Aug 3, 2022
f6a6afc
Refacto : train pipeline
Aug 4, 2022
0a711b7
Delete docker-compose
infini11 Aug 11, 2022
a5e23fc
Change version of mlflow, config updated
Aug 19, 2022
8f73e78
Merge branch 'orchestration_train' of https://github.com/Aura-healthc…
Aug 19, 2022
5900ffe
Merge branch 'orchestration_train' of https://github.com/Aura-healthc…
Aug 19, 2022
a129505
Merge branch 'orchestration_train' of https://github.com/Aura-healthc…
Aug 19, 2022
9172281
Delete env_aura directory
infini11 Aug 23, 2022
8edd221
Fixed : MLflow dependencies conflits
infini11 Aug 24, 2022
c29334e
REFACTO : Model_params_dict
infini11 Sep 1, 2022
211a28e
REFACTO : Train pipeline for manually uses
infini11 Sep 1, 2022
85de392
Last update
infini11 Oct 6, 2022
3c6c8fd
Some config
infini11 Oct 6, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/github-actions-seizure-pipline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,4 @@ jobs:
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest -s -vvv ./tests
pytest -s -vvv ./tests --cov=src --cov-fail-under=80
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ train:

train_ml:
. $(FOLDER_PATH)/env/bin/activate; \
python3 src/usecase/train_model.py --ml-dataset-path /home/DATA/DetecTeppe-2022-04-08/ML_ready/ML/train/df_ml_train.csv --ml-dataset-path-test /home/DATA/DetecTeppe-2022-04-08/ML_ready/ML/test/df_ml_test.csv
python3 src/usecase/train_model.py --ml-dataset-path /home/DATA/DetecTeppe-2022-04-08/ml_dataset_2022_04_08/train/df_ml_train.csv --ml-dataset-path-test /home/DATA/DetecTeppe-2022-04-08/ml_dataset_2022_04_08/test/df_ml_test.csv


## VISUALIZATION
Expand Down
61 changes: 61 additions & 0 deletions dags/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
import os
import sys
from datetime import datetime as dt
from sklearn.ensemble import RandomForestClassifier
import datetime
import xgboost as xgb
import numpy as np

PROJECT_FOLDER = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
DATA_FOLDER = os.path.join(PROJECT_FOLDER, 'data')

ML_DATASET_OUTPUT_FOLDER = "/opt/airflow/output"
AIRFLOW_PREFIX_TO_DATA = '/opt/airflow/data/'
MLRUNS_DIR = '/mlruns'

TRAIN_DATA = os.path.join(AIRFLOW_PREFIX_TO_DATA, "train/df_ml_train.csv")
TEST_DATA = os.path.join(AIRFLOW_PREFIX_TO_DATA , "test/df_ml_test.csv")
FEATURE_TRAIN_PATH= os.path.join(ML_DATASET_OUTPUT_FOLDER, "ml_train.csv")
FEATURE_TEST_PATH= os.path.join(ML_DATASET_OUTPUT_FOLDER, "ml_test.csv")

COL_TO_DROP = ['interval_index', 'interval_start_time', 'set']

START_DATE = dt(2021, 8, 1)
CONCURRENCY = 4
SCHEDULE_INTERVAL = datetime.timedelta(hours=2)
DEFAULT_ARGS = {'owner': 'airflow'}

TRACKING_URI = 'http://mlflow:5000'

MODEL_PARAM = {
'model': xgb.XGBClassifier(),
'grid_parameters': {
'nthread':[4],
'learning_rate': [0.1, 0.01, 0.05],
'max_depth': np.arange(3, 5, 2),
'scale_pos_weight':[1],
'n_estimators': np.arange(15, 25, 2),
'missing':[-999]}
}

MODELS_PARAM = {
'xgboost': {
'model': xgb.XGBClassifier(),
'grid_parameters': {
'nthread':[4],
'learning_rate': [0.1, 0.01, 0.05],
'max_depth': np.arange(3, 5, 2),
'scale_pos_weight':[1],
'n_estimators': np.arange(15, 25, 2),
'missing':[-999]
}
},
'random_forest': {
'model': RandomForestClassifier(),
'grid_parameters': {
'min_samples_leaf': np.arange(1, 5, 1),
'max_depth': np.arange(1, 7, 1),
'max_features': ['auto'],
'n_estimators': np.arange(10, 20, 2)}
}
}
28 changes: 28 additions & 0 deletions dags/predict.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
import os
import sys
from datetime import datetime, timedelta, datetime

from airflow.decorators import dag, task
from airflow.utils.dates import days_ago

sys.path.append('.')
from dags.config import (DEFAULT_ARGS, START_DATE, CONCURRENCY, SCHEDULE_INTERVAL)


@dag(default_args=DEFAULT_ARGS,
start_date=START_DATE,
schedule_interval=timedelta(minutes=2),
concurrency=CONCURRENCY)
def predict():
@task
def prepare_features_with_io_task() -> str:
pass

@task
def predict_with_io_task(feature_path: str) -> None:
pass

feature_path = prepare_features_with_io_task()
predict_with_io_task(feature_path)

predict_dag = predict()
69 changes: 69 additions & 0 deletions dags/train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
from src.usecase.data_processing.prepare_features import prepare_features_with_io
from src.usecase.train_model import (train_pipeline_with_io)
from dags.config import (
DEFAULT_ARGS,
START_DATE,
CONCURRENCY,
SCHEDULE_INTERVAL,
MODELS_PARAM,
MLRUNS_DIR,
TEST_DATA,
TRACKING_URI,
TRAIN_DATA,
FEATURE_TRAIN_PATH,
FEATURE_TEST_PATH,
COL_TO_DROP)
import sys

from airflow.decorators import dag, task

@dag(default_args=DEFAULT_ARGS,
start_date=START_DATE,
schedule_interval=SCHEDULE_INTERVAL,
catchup=False,
concurrency=CONCURRENCY)
def train_pipeline():

@task
def prepare_features_task(
dataset_path: str,
col_to_drop: list,
feature_path: str) -> str:

prepare_features_with_io(
dataset_path=dataset_path,
col_to_drop=col_to_drop,
features_path=feature_path)

return feature_path

@task
def train_model_task(
feature_tain_path: str,
feature_test_path: str,
tracking_uri: str = TRACKING_URI,
model_param: dict = MODELS_PARAM['xgboost'],
mlruns_dir: str = MLRUNS_DIR) -> None:

train_pipeline_with_io(feature_tain_path, feature_test_path,
tracking_uri=tracking_uri, model_param=model_param, mlruns_dir=mlruns_dir)

# Orchestration
features_train_path = FEATURE_TRAIN_PATH
features_test_path = FEATURE_TEST_PATH

ml_train_path = prepare_features_task(
dataset_path=TRAIN_DATA,
col_to_drop=COL_TO_DROP,
feature_path=features_train_path)

ml_test_path = prepare_features_task(
dataset_path=TEST_DATA,
col_to_drop=COL_TO_DROP,
feature_path=features_test_path)

train_model_task(feature_tain_path=ml_train_path, feature_test_path=ml_test_path, tracking_uri=TRACKING_URI,
model_param=MODELS_PARAM['xgboost'], mlruns_dir=MLRUNS_DIR)


train_pipeline_dag = train_pipeline()
1 change: 1 addition & 0 deletions data/PL
1 change: 1 addition & 0 deletions data/ml_dataset_2022_04_08
Loading