Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify model launcher configs and add script input checks #90

Merged
merged 7 commits into from
Sep 9, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 1 addition & 25 deletions .github/workflows/publish-to-pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/<package-name> # Replace <package-name> with your PyPI project name
url: https://pypi.org/p/meds-tab # Replace <package-name> with your PyPI project name
permissions:
id-token: write # IMPORTANT: mandatory for trusted publishing

Expand Down Expand Up @@ -91,27 +91,3 @@ jobs:
gh release upload
'${{ github.ref_name }}' dist/**
--repo '${{ github.repository }}'

publish-to-testpypi:
name: Publish Python 🐍 distribution 📦 to TestPyPI
needs:
- build
runs-on: ubuntu-latest

environment:
name: testpypi
url: https://test.pypi.org/p/<package-name>

permissions:
id-token: write # IMPORTANT: mandatory for trusted publishing

steps:
- name: Download all the dists
uses: actions/download-artifact@v3
with:
name: python-package-distributions
path: dist/
- name: Publish distribution 📦 to TestPyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
repository-url: https://test.pypi.org/legacy/
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ dependencies = [
"scikit-learn", "hydra-optuna-sweeper", "hydra-joblib-launcher", "ml-mixins", "meds==0.3.3", "meds-transforms==0.0.7",
]

[tool.setuptools_scm]

[project.scripts]
meds-tab-describe = "MEDS_tabular_automl.scripts.describe_codes:main"
meds-tab-tabularize-static = "MEDS_tabular_automl.scripts.tabularize_static:main"
Expand Down
2 changes: 1 addition & 1 deletion src/MEDS_tabular_automl/configs/describe_codes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ defaults:
- default
- _self_

input_dir: ${output_cohort_dir}/data
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is MEDS_cohort_dir used for anything else? If not, do we need to specify it and input_dir separately? Can we just have one parameter, which will help avoid the confusion that comes about in the setting where you are or aren't using a resharding stage (b/c when you are using a re-sharding stage, the raw MEDS_cohort_dir is only the input to that first resharding stage)

input_dir: ${MEDS_cohort_dir}/data
# Where to store output code frequency data
output_filepath: ${output_cohort_dir}/metadata/codes.parquet

Expand Down
25 changes: 12 additions & 13 deletions src/MEDS_tabular_automl/configs/launch_autogluon.yaml
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is possible we are trying to hard to have the configs be "the same" or "similar" between tabularization and modeling. I think separating them out more would be good, because then you can not have the ambiguity of things like the "output cohort dir" meaning an output during tabularization and an input during modeling, etc. I'm not sure what exactly that would look like, but something in there I think would probably be smart.

Copy link
Collaborator Author

@Oufattole Oufattole Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh I see, @teyaberg proposed we call it output_model_dir instead. Let's go with that.

Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,25 @@ defaults:
- tabularization: default
- imputer: default
- normalization: default
- model_launcher: autogluon
- _self_

task_name: task
task_name: ???

# Task cached data dir
input_dir: ${output_cohort_dir}/${task_name}/task_cache
# Directory with task labels
input_label_dir: ${output_cohort_dir}/${task_name}/labels/
# Where to output the model and cached data
model_dir: ${output_cohort_dir}/autogluon/autogluon_${now:%Y-%m-%d_%H-%M-%S}
model_log_dir: ${model_dir}/.logs/
output_filepath: ${model_dir}

# Model parameters
model_params:
iterator:
keep_data_in_memory: True
binarize_task: True

log_dir: ${model_dir}/.logs/
log_filepath: ${log_dir}/log.txt
output_dir: ???

name: launch_autogluon

hydra:
verbose: False
job:
name: MEDS_TAB_${name}_${worker}_${now:%Y-%m-%d_%H-%M-%S}
Oufattole marked this conversation as resolved.
Show resolved Hide resolved
sweep:
dir: ${model_log_dir}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, my first comment here was right. I don't see where this is defined at the top level. It might be defined nested within a sub-config, but I don't think this will work in that case.

run:
dir: ${model_log_dir}
18 changes: 5 additions & 13 deletions src/MEDS_tabular_automl/configs/launch_model.yaml
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the whole hydra block for this and autogluon should be shared?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And some other params, actually...

Copy link
Collaborator Author

@Oufattole Oufattole Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this yaml can be used for all model launchers (autogluon, sklearn, and xgboost), and we can add a stage check for autogluon that makes sure users do not apply multirun which will apply the overrides for hydra/sweeper hydra/callbacks and hydra/launcher.

Original file line number Diff line number Diff line change
Expand Up @@ -2,30 +2,22 @@ defaults:
- _self_
- default
- tabularization: default
- model: xgboost # This can be changed to sgd_classifier or any other model
- imputer: default
- normalization: default
- model_launcher: xgboost
- override hydra/callbacks: evaluation_callback
- override hydra/sweeper: optuna
- override hydra/sweeper/sampler: tpe
- override hydra/launcher: joblib

task_name: task
task_name: ???

# Task cached data dir
input_dir: ${output_cohort_dir}/${task_name}/task_cache
# Directory with task labels
input_label_dir: ${output_cohort_dir}/${task_name}/labels/
# Where to output the model and cached data
model_saving:
model_dir: ${output_cohort_dir}/model/model_${now:%Y-%m-%d_%H-%M-%S}
model_file_stem: model
model_file_extension: .json
delete_below_top_k: -1
model_logging:
model_log_dir: ${model_saving.model_dir}/.logs/
performance_log_stem: performance
config_log_stem: config
output_dir: ???

delete_below_top_k: -1

name: launch_model

Expand Down
44 changes: 0 additions & 44 deletions src/MEDS_tabular_automl/configs/model/knn_classifier.yaml

This file was deleted.

49 changes: 0 additions & 49 deletions src/MEDS_tabular_automl/configs/model/logistic_regression.yaml

This file was deleted.

This file was deleted.

38 changes: 0 additions & 38 deletions src/MEDS_tabular_automl/configs/model/sgd_classifier.yaml

This file was deleted.

47 changes: 0 additions & 47 deletions src/MEDS_tabular_automl/configs/model/xgboost.yaml

This file was deleted.

3 changes: 3 additions & 0 deletions src/MEDS_tabular_automl/configs/model_launcher/autogluon.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
defaults:
- default
- _self_
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
keep_data_in_memory: True
binarize_task: True
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
defaults:
- imputer: default
- normalization: default
7 changes: 7 additions & 0 deletions src/MEDS_tabular_automl/configs/model_launcher/default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
defaults:
- path: default
- data_processing_params: default
- data_loading_params: default
- _self_

tabularization: ${tabularization}
mmcdermott marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
hydra:
sweeper:
direction: maximize
n_trials: 250
n_jobs: 25
Loading
Loading