Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify model launcher configs and add script input checks #90

Merged
merged 7 commits into from
Sep 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 1 addition & 25 deletions .github/workflows/publish-to-pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/<package-name> # Replace <package-name> with your PyPI project name
url: https://pypi.org/p/meds-tab # Replace <package-name> with your PyPI project name
permissions:
id-token: write # IMPORTANT: mandatory for trusted publishing

Expand Down Expand Up @@ -91,27 +91,3 @@ jobs:
gh release upload
'${{ github.ref_name }}' dist/**
--repo '${{ github.repository }}'

publish-to-testpypi:
name: Publish Python 🐍 distribution 📦 to TestPyPI
needs:
- build
runs-on: ubuntu-latest

environment:
name: testpypi
url: https://test.pypi.org/p/<package-name>

permissions:
id-token: write # IMPORTANT: mandatory for trusted publishing

steps:
- name: Download all the dists
uses: actions/download-artifact@v3
with:
name: python-package-distributions
path: dist/
- name: Publish distribution 📦 to TestPyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
repository-url: https://test.pypi.org/legacy/
36 changes: 18 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,12 +84,12 @@ By following these steps, you can seamlessly transform your dataset, define nece

```console
# Re-shard pipeline
# $MIMICIV_MEDS_DIR is the directory containing the input, MEDS v0.3 formatted MIMIC-IV data
# $MIMICIV_input_dir is the directory containing the input, MEDS v0.3 formatted MIMIC-IV data
# $MEDS_TAB_COHORT_DIR is the directory where the re-sharded MEDS dataset will be stored, and where your model
# will store cached files during processing by default.
# $N_PATIENTS_PER_SHARD is the number of patients per shard you want to use.
MEDS_transform-reshard_to_split \
input_dir="$MIMICIV_MEDS_DIR" \
input_dir="$MIMICIV_input_dir" \
cohort_dir="$MEDS_TAB_COHORT_DIR" \
'stages=["reshard_to_split"]' \
stage="reshard_to_split" \
Expand All @@ -103,14 +103,14 @@ By following these steps, you can seamlessly transform your dataset, define nece
- static codes (codes without timestamps)
- static numerical codes (codes without timestamps but with numerical values).

This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.
This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `input_dir` argument specified as a hydra-style command line argument.

2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `subject_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.

**Example: Tabularizing static data** with the minimum code frequency of 10, window sizes of `[1d, 30d, 365d, full]`, and value aggregation methods of `[static/present, static/first, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`

```console
meds-tab-tabularize-static MEDS_cohort_dir="path_to_data" \
meds-tab-tabularize-static input_dir="path_to_data" \
tabularization.min_code_inclusion_frequency=10 \
tabularization.window_sizes=[1d,30d,365d,full] \
do_overwrite=False \
Expand All @@ -127,19 +127,19 @@ By following these steps, you can seamlessly transform your dataset, define nece
meds-tab-tabularize-time-series --multirun \
worker="range(0,$N_PARALLEL_WORKERS)" \
hydra/launcher=joblib \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
tabularization.min_code_inclusion_frequency=10 \
do_overwrite=False \
tabularization.window_sizes=[1d,30d,365d,full] \
tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]
```

4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `input_dir`.

**Example: Align tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)

```console
meds-tab-cache-task MEDS_cohort_dir="path_to_data" \
meds-tab-cache-task input_dir="path_to_data" \
task_name=$TASK \
tabularization.min_code_inclusion_frequency=10 \
do_overwrite=False \
Expand All @@ -151,7 +151,7 @@ By following these steps, you can seamlessly transform your dataset, define nece

```console
meds-tab-xgboost --multirun \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
tabularization.min_code_inclusion_frequency=10 \
Expand Down Expand Up @@ -436,7 +436,7 @@ A single XGBoost run was completed to profile time and memory usage. This was do

```console
meds-tab-xgboost
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
do_overwrite=False \
Expand Down Expand Up @@ -506,7 +506,7 @@ The XGBoost sweep was run using the following command for each `$TASK`:

```console
meds-tab-xgboost --multirun \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
tabularization.window_sizes=$(generate-subsets [1d,30d,365d,full]) \
Expand All @@ -529,14 +529,14 @@ The hydra sweeper swept over the parameters:

```yaml
params:
+model_params.model.eta: tag(log, interval(0.001, 1))
+model_params.model.lambda: tag(log, interval(0.001, 1))
+model_params.model.alpha: tag(log, interval(0.001, 1))
+model_params.model.subsample: interval(0.5, 1)
+model_params.model.min_child_weight: interval(1e-2, 100)
+model_params.model.max_depth: range(2, 16)
model_params.num_boost_round: range(100, 1000)
model_params.early_stopping_rounds: range(1, 10)
model.eta: tag(log, interval(0.001, 1))
model.lambda: tag(log, interval(0.001, 1))
model.alpha: tag(log, interval(0.001, 1))
model.subsample: interval(0.5, 1)
model.min_child_weight: interval(1e-2, 100)
model.max_depth: range(2, 16)
num_boost_round: range(100, 1000)
early_stopping_rounds: range(1, 10)
tabularization.min_code_inclusion_frequency: tag(log, range(10, 1000000))
```

Expand Down
12 changes: 6 additions & 6 deletions docs/source/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,14 @@ See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_Au
- static codes (codes without timestamps)
- static numerical codes (codes without timestamps but with numerical values).

This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.
This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `input_dir` argument specified as a hydra-style command line argument.

2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `subject_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.

**Example: Tabularizing static data** with the minimum code frequency of 10, window sizes of `[1d, 30d, 365d, full]`, and value aggregation methods of `[static/present, static/first, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`

```console
meds-tab-tabularize-static MEDS_cohort_dir="path_to_data" \
meds-tab-tabularize-static input_dir="path_to_data" \
tabularization.min_code_inclusion_frequency=10 \
tabularization.window_sizes=[1d,30d,365d,full] \
do_overwrite=False \
Expand All @@ -62,19 +62,19 @@ See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_Au
meds-tab-tabularize-time-series --multirun \
worker="range(0,$N_PARALLEL_WORKERS)" \
hydra/launcher=joblib \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
tabularization.min_code_inclusion_frequency=10 \
do_overwrite=False \
tabularization.window_sizes=[1d,30d,365d,full] \
tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]
```

4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `input_dir`.

**Example: Align tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)

```console
meds-tab-cache-task MEDS_cohort_dir="path_to_data" \
meds-tab-cache-task input_dir="path_to_data" \
task_name=$TASK \
tabularization.min_code_inclusion_frequency=10 \
do_overwrite=False \
Expand All @@ -86,7 +86,7 @@ See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_Au

```console
meds-tab-xgboost --multirun \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
tabularization.min_code_inclusion_frequency=10 \
Expand Down
20 changes: 10 additions & 10 deletions docs/source/prediction.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ A single XGBoost run was completed to profile time and memory usage. This was do

```console
meds-tab-xgboost
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
do_overwrite=False \
Expand Down Expand Up @@ -84,7 +84,7 @@ The XGBoost sweep was run using the following command for each `$TASK`:

```console
meds-tab-xgboost --multirun \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
tabularization.window_sizes=$(generate-permutations [1d,30d,365d,full]) \
Expand All @@ -107,14 +107,14 @@ The hydra sweeper swept over the parameters:

```yaml
params:
+model_params.model.eta: tag(log, interval(0.001, 1))
+model_params.model.lambda: tag(log, interval(0.001, 1))
+model_params.model.alpha: tag(log, interval(0.001, 1))
+model_params.model.subsample: interval(0.5, 1)
+model_params.model.min_child_weight: interval(1e-2, 100)
+model_params.model.max_depth: range(2, 16)
model_params.num_boost_round: range(100, 1000)
model_params.early_stopping_rounds: range(1, 10)
model.eta: tag(log, interval(0.001, 1))
model.lambda: tag(log, interval(0.001, 1))
model.alpha: tag(log, interval(0.001, 1))
model.subsample: interval(0.5, 1)
model.min_child_weight: interval(1e-2, 100)
model.max_depth: range(2, 16)
num_boost_round: range(100, 1000)
early_stopping_rounds: range(1, 10)
tabularization.min_code_inclusion_frequency: tag(log, range(10, 1000000))
```

Expand Down
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ dependencies = [
"scikit-learn", "hydra-optuna-sweeper", "hydra-joblib-launcher", "ml-mixins", "meds==0.3.3", "meds-transforms==0.0.7",
]

[tool.setuptools_scm]

[project.scripts]
meds-tab-describe = "MEDS_tabular_automl.scripts.describe_codes:main"
meds-tab-tabularize-static = "MEDS_tabular_automl.scripts.tabularize_static:main"
Expand Down
8 changes: 4 additions & 4 deletions src/MEDS_tabular_automl/configs/default.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
MEDS_cohort_dir: ???
output_cohort_dir: ???
input_dir: ???
output_dir: ???
do_overwrite: False
seed: 1
tqdm: False
worker: 0
loguru_init: False

log_dir: ${output_cohort_dir}/.logs/
cache_dir: ${output_cohort_dir}/.cache
log_dir: ${output_dir}/.logs/
cache_dir: ${output_dir}/.cache

hydra:
verbose: False
Expand Down
3 changes: 1 addition & 2 deletions src/MEDS_tabular_automl/configs/describe_codes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@ defaults:
- default
- _self_

input_dir: ${output_cohort_dir}/data
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is MEDS_cohort_dir used for anything else? If not, do we need to specify it and input_dir separately? Can we just have one parameter, which will help avoid the confusion that comes about in the setting where you are or aren't using a resharding stage (b/c when you are using a re-sharding stage, the raw MEDS_cohort_dir is only the input to that first resharding stage)

# Where to store output code frequency data
output_filepath: ${output_cohort_dir}/metadata/codes.parquet
output_filepath: ${output_dir}/metadata/codes.parquet

name: describe_codes
28 changes: 0 additions & 28 deletions src/MEDS_tabular_automl/configs/launch_autogluon.yaml

This file was deleted.

39 changes: 14 additions & 25 deletions src/MEDS_tabular_automl/configs/launch_model.yaml
Original file line number Diff line number Diff line change
@@ -1,39 +1,28 @@
defaults:
- _self_
- default
- tabularization: default
- model: xgboost # This can be changed to sgd_classifier or any other model
- imputer: default
- normalization: default
- override hydra/callbacks: evaluation_callback
- model_launcher: xgboost
- override hydra/sweeper: optuna
- override hydra/sweeper/sampler: tpe
- override hydra/callbacks: evaluation_callback
- override hydra/launcher: joblib
- _self_

task_name: task
task_name: ???

# Task cached data dir
input_dir: ${output_cohort_dir}/${task_name}/task_cache
# Directory with task labels
input_label_dir: ${output_cohort_dir}/${task_name}/labels/
# Location of task, split, and shard specific tabularized data
input_tabularized_cache_dir: ${output_dir}/${task_name}/task_cache
# Location of task, split, and shard specific label data
input_label_cache_dir: ${output_dir}/${task_name}/labels
# Where to output the model and cached data
model_saving:
model_dir: ${output_cohort_dir}/model/model_${now:%Y-%m-%d_%H-%M-%S}
model_file_stem: model
model_file_extension: .json
delete_below_top_k: -1
model_logging:
model_log_dir: ${model_saving.model_dir}/.logs/
performance_log_stem: performance
config_log_stem: config
output_model_dir: ???

delete_below_top_k: -1

name: launch_model

hydra:
verbose: False
job:
name: MEDS_TAB_${name}_${worker}_${now:%Y-%m-%d_%H-%M-%S}
sweep:
dir: ${model_log_dir}
dir: ${output_model_dir}/sweeps/{now:%Y-%m-%d-%H-%M-%S}/
subdir: "1"
run:
dir: ${model_log_dir}
dir: ${path.model_log_dir}
44 changes: 0 additions & 44 deletions src/MEDS_tabular_automl/configs/model/knn_classifier.yaml

This file was deleted.

Loading
Loading