Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting Esgpt caching to work for MEDS datasets #1

Merged
merged 75 commits into from
Jun 5, 2024
Merged
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
d5ff0df
added test case and initial testing and and code for shardingoutput t…
Oufattole May 25, 2024
4a486aa
added static feature pivoting
Oufattole May 26, 2024
19f0f4e
added docstrings and a smiple test case checking the number of subjec…
Oufattole May 26, 2024
fd1731f
Refactor scripts into separate modules for improved clarity:
Oufattole May 27, 2024
63b9ba6
fixed doctests and updated github workflow tests to use python 3.12
Oufattole May 27, 2024
cd067f8
Implement data processing for MEDS format to pivot tables into two in…
Oufattole May 27, 2024
c8ca3bb
Enhance data aggregation framework with dynamic window and aggregatio…
Oufattole May 27, 2024
7fdc37d
Update src/MEDS_tabular_automl/generate_summarized_reps.py
mmcdermott May 27, 2024
4dd3cad
Update src/MEDS_tabular_automl/generate_summarized_reps.py
mmcdermott May 27, 2024
720a533
Update src/MEDS_tabular_automl/generate_summarized_reps.py
mmcdermott May 27, 2024
548e29a
Added doctest and updated docstrings in identiy_columns.py. [WIP] add…
Oufattole May 27, 2024
d9ba7e7
Merge branch 'esgpt_caching' of github.com:mmcdermott/MEDS_Tabular_Au…
Oufattole May 27, 2024
f0b1cbb
working on xgboost
teyaberg May 28, 2024
ba954ef
current state
Oufattole May 28, 2024
df2750a
Removed tqdm, fixed deprecated groupbys, fixed doctest long-line issue.
mmcdermott May 28, 2024
4bbbc20
Fixed one of the summary doctests.
mmcdermott May 28, 2024
d39bf1a
updates based on formats... still many to dos
teyaberg May 28, 2024
41fe4b4
using sparse matrices for generating time series representations
Oufattole May 29, 2024
cb5f689
still working on sparse matrix to external memory xgboost
teyaberg May 29, 2024
8bc9a16
same problem
teyaberg May 29, 2024
1e27526
cleaned some testing
teyaberg May 29, 2024
97938a8
sped up the tabularize_ts script by about 30% by concatenating the sp…
Oufattole May 29, 2024
c28e6b2
got iterator working with csr_matrices for X and numpy arrays for y
teyaberg May 29, 2024
6f3b1ec
added support for sparse aggregations
Oufattole May 29, 2024
6753609
passing unit tests for sparse aggregations (only code/count and value…
Oufattole May 29, 2024
f125600
added significant speed improvements for rolling window aggregations
Oufattole May 29, 2024
2acc3bc
improved speed, by removing conversion from sparse scipy matrix to sp…
Oufattole May 29, 2024
eec05e2
takes about an hour to run through a shard. The speed gain is from me…
Oufattole May 30, 2024
bd9bdae
added scripts to the readme
Oufattole May 30, 2024
29c8c5f
save before breaking it
teyaberg May 30, 2024
4c7d3e7
added support for parallelism using mapper warp function. We cache fe…
Oufattole May 30, 2024
ba796e5
wip
teyaberg May 30, 2024
3678d30
automl
teyaberg May 30, 2024
82b3903
Merge branch 'esgpt_caching' into xgboost
Oufattole May 30, 2024
ffa0f3c
working on collect_in_memory
teyaberg May 31, 2024
c8f26ea
collect in memory fixed
teyaberg May 31, 2024
f6a3751
added hf_cohort scripts
May 31, 2024
2ec1860
Apply suggestions from code review
mmcdermott May 31, 2024
db18dc5
cleaning
teyaberg May 31, 2024
abba3d2
local WIP--changing to sparse matrix implementation
teyaberg May 31, 2024
77f296f
added merging of static and time series data
May 31, 2024
e8f26eb
Merge branch 'esgpt_caching' of github.com:mmcdermott/MEDS_Tabular_Au…
May 31, 2024
958906d
merging script runs, but the output is 50GB
Oufattole May 31, 2024
7668382
merging script works and is efficient
Oufattole May 31, 2024
b6b8d43
fixed bug with sparse matrix shape being too small for merging static…
Oufattole May 31, 2024
e6a88a7
changed to sparse format
teyaberg Jun 1, 2024
e8d64fd
added script for extracting tasks using aces
Oufattole Jun 1, 2024
5b2f7f7
merged xgboost code
Oufattole Jun 1, 2024
5c5dc8e
added dependencies
Oufattole Jun 1, 2024
357845e
Merge branch 'xgboost' into esgpt_caching
Oufattole Jun 1, 2024
d99e274
added support for loading cached labels and event indexes
Oufattole Jun 1, 2024
cadc603
updated readme
Oufattole Jun 1, 2024
285ccbf
size issues for loading sparse matrix
teyaberg Jun 1, 2024
795b532
push updates
teyaberg Jun 1, 2024
b9d057b
4x speed increase for tabularization to sparse matrix by caching wind…
Oufattole Jun 1, 2024
85f38b5
Merge branch 'xgboost' into esgpt_caching
Oufattole Jun 1, 2024
7ea3230
standardized file storage using file_name.py and updated from using n…
Oufattole Jun 2, 2024
23a2e3b
cleaned up file paths so we can load all aggregations selectively and…
Oufattole Jun 2, 2024
c225c47
fixed bug with codes that are only in the test and validation set (no…
Jun 2, 2024
cb21821
fixed bug with summarization script crashing for min and max value ag…
Jun 2, 2024
3a412a0
removed overwrite killing of jobs which causes errors in multirun
Jun 2, 2024
a4f1843
Xgboost is able to load all concatenated windows and aggregations. Fi…
Oufattole Jun 2, 2024
800ab7e
fixed timedelta overflow bug
Jun 2, 2024
820e194
Merge branch 'esgpt_caching' of github.com:mmcdermott/MEDS_Tabular_Au…
Jun 2, 2024
4b0637a
fixed bug with loading feature columns json for aces task script
Jun 2, 2024
127d04a
added memory profiling to hf_cohort e2e script
Oufattole Jun 2, 2024
23877ad
Merge branch 'esgpt_caching' of github.com:mmcdermott/MEDS_Tabular_Au…
Oufattole Jun 2, 2024
36f54a3
Made tests ignore the hf_cohort directory
mmcdermott Jun 2, 2024
81bf2d9
Pre-commit fixes
mmcdermott Jun 2, 2024
83c4eec
Resolving deprecation warnings
mmcdermott Jun 2, 2024
e7a85ba
Fixed test installation instructions.
mmcdermott Jun 2, 2024
35acb97
Merge branch 'esgpt_caching' into mmd_changes
mmcdermott Jun 2, 2024
bef63b6
Resolved one error (or, rather, shifted it) by making some things pro…
mmcdermott Jun 2, 2024
e9775e2
Shifted more test errors around, but the failures are deeper than exp…
mmcdermott Jun 2, 2024
c8f4144
Merge pull request #4 from mmcdermott/mmd_changes
mmcdermott Jun 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,10 @@ jobs:
- name: Checkout
uses: actions/checkout@v3

- name: Set up Python 3.11
- name: Set up Python 3.12
uses: actions/setup-python@v3
with:
python-version: "3.11"
python-version: "3.12"

- name: Install packages
run: |
Expand Down
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ repos:
rev: v2.2.0
hooks:
- id: autoflake
args: [--in-place, --remove-all-unused-imports]

# python upgrading syntax to newer version
- repo: https://github.com/asottile/pyupgrade
Expand Down
111 changes: 86 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Scalable tabularization and tabular feature usage utilities over generic MEDS datasets

This repository provides utilities and scripts to run limited automatic tabular ML pipelines for generic MEDS
datasets.

#### Q1: What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?

This is a common misconception. _Tabular_ data refers to data that can be organized in a consistent, logical
set of rows/columns such that the entirety of a "sample" or "instance" for modeling or analysis is contained
in a single row, and the set of columns possibly observed (there can be missingness) is consistent across all
Expand All @@ -15,28 +17,62 @@ or future windows in time to produce a single row per patient with a consistent,
(though there may still be missingness).

#### Q2: Why not other systems?
- [TemporAI](https://github.com/vanderschaarlab/temporai) is the most natural competitor, and already
supports AutoML capabilities. However, TemporAI (as of now) does not support generic MEDS datasets, and it
is not clear if their AutoML systems will scale to the size of datasets we need to support. But, further
investigation is needed, and it may be the case that the best solution here is simply to write a custom
data source for MEDS data within TemporAI and leverage their tools.

- [TemporAI](https://github.com/vanderschaarlab/temporai) is the most natural competitor, and already
supports AutoML capabilities. However, TemporAI (as of now) does not support generic MEDS datasets, and it
is not clear if their AutoML systems will scale to the size of datasets we need to support. But, further
investigation is needed, and it may be the case that the best solution here is simply to write a custom
data source for MEDS data within TemporAI and leverage their tools.

# Installation

Clone this repository and install the requirements by running `pip install .` in the root directory.

# Usage

This repository consists of two key pieces:
1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing
patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) either backwards or
forwards in time from a given index date. Naturally, only "look-back" windows should be used for
future-event prediction tasks; however, the capability to summarize "look-ahead" windows is also useful
for characterizing and describing the differences between patient populations statistically.
2. Running basic AutoML pipelines over these tabular features to predict arbitrary binary classification
downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced --
what is more advanced is the efficient construction, storage, and loading of tabular features for the
candidate AutoML models, enabling a far more extensive search over different featurization strategies.

1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing
patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) either backwards or
forwards in time from a given index date. Naturally, only "look-back" windows should be used for
future-event prediction tasks; however, the capability to summarize "look-ahead" windows is also useful
for characterizing and describing the differences between patient populations statistically.
2. Running basic AutoML pipelines over these tabular features to predict arbitrary binary classification
downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced --
what is more advanced is the efficient construction, storage, and loading of tabular features for the
candidate AutoML models, enabling a far more extensive search over different featurization strategies.

Comment on lines +32 to +44
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify the usage instructions to ensure they are easy to follow.

- See `tests/test_tabularize_integration.py` for an example of the end-to-end pipeline being run on synthetic data.
+ For an example of the end-to-end pipeline execution on synthetic data, refer to `tests/test_tabularize_integration.py`.

Committable suggestion was skipped due low confidence.

### Scripts and Examples

See `tests/test_tabularize_integration.py` for an example of the end-to-end pipeline being run on synthetic data. This
script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.

#### Core Scripts:

1. `scripts/identify_columns.py` loads all training shard to identify which feature columns
to generate tabular data for.
2. `scripts/tabularize_static.py` Iterates through shards and generates tabular vectors for
each patient. There is a single row per patient for each shard.
3. `scripts/summarize_over_windows.py` For each shard, iterates through window sizes and aggregations to and
horizontally concatenates the outputs to generate the final tabular representations at every event time for
every patient.
4. `scripts/tabularize_merge` Aligns the time-series window aggregations (generated in the previous step) with
the static tabular vectors and caches them for training.
5. `scripts/hf_cohort/aces_task_extraction.py` Generates the task labels and caches them with the event_id
indexes which align them with the nearest prior event in the tabular data.
6. `scripts/xgboost_sweep.py` Tunes XGboost on methods. Iterates through the labels and corresponding tabular data.

We run this on an example dataset using the following bash scripts in sequence:

```bash
bash hf_cohort_shard.sh # processes the dataset into meds format
bash hf_cohort_e2e.sh # performs (steps 1-4 above)
bash hf_cohort/aces_task.sh # generates labels (step 5)
bash xgboost.sh # trains xgboos (step 6)
```

## Feature Construction, Storage, and Loading

Tabularization of a (raw) MEDS dataset is done by running the `scripts/data/tabularize.py` script. This script
must inherently do a base level of preprocessing over the MEDS data, then will construct a sharded tabular
representation that respects the overall sharding of the raw data. This script uses [Hydra](https://hydra.cc/)
Expand All @@ -45,14 +81,39 @@ to manage configuration, and the configuration file is located at `configs/tabul
## AutoML Pipelines

# TODOs
mmcdermott marked this conversation as resolved.
Show resolved Hide resolved
1. Leverage the "event bound aggregation" capabilities of [ESGPT Task
Select](https://github.com/justin13601/ESGPTTaskQuerying/) to construct tabular summary features for
event-bound historical windows (e.g., until the prior admission, until the last diagnosis of some type,
etc.).
2. Support more feature aggregation functions.
3. Probably rename this repository, as the focus is really more on the tabularization and feature usage
utilities than on the AutoML pipelines themselves.
4. Import, rather than reimplement, the mapper utilities from the MEDS preprocessing repository.
5. Investigate the feasibility of using TemporAI for this task.
6. Consider splitting the feature construction and AutoML pipeline parts of this repository into separate
repositories.

1. Leverage the "event bound aggregation" capabilities of [ESGPT Task
Select](https://github.com/justin13601/ESGPTTaskQuerying/) to construct tabular summary features for
event-bound historical windows (e.g., until the prior admission, until the last diagnosis of some type,
etc.).
2. Support more feature aggregation functions.
3. Probably rename this repository, as the focus is really more on the tabularization and feature usage
utilities than on the AutoML pipelines themselves.
4. Import, rather than reimplement, the mapper utilities from the MEDS preprocessing repository.
5. Investigate the feasibility of using TemporAI for this task.
6. Consider splitting the feature construction and AutoML pipeline parts of this repository into separate
repositories.

Comment on lines +84 to +96
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address the TODOs to ensure they are actively tracked and prioritized.

Would you like me to help create GitHub issues for these TODOs to ensure they are not overlooked?

# YAML Configuration File

- `MEDS_cohort_dir`: directory of MEDS format dataset that is ingested.
- `tabularized_data_dir`: output directory of tabularized data.
- `min_code_inclusion_frequency`: The base feature inclusion frequency that should be used to dictate
what features can be included in the flat representation. It can either be a float, in which
case it applies across all measurements, or `None`, in which case no filtering is applied, or
a dictionary from measurement type to a float dictating a per-measurement-type inclusion
cutoff.
- `window_sizes`: Beyond writing out a raw, per-event flattened representation, the dataset also has
the capability to summarize these flattened representations over the historical windows
specified in this argument. These are strings specifying time deltas, using this syntax:
`link`\_. Each window size will be summarized to a separate directory, and will share the same
subject file split as is used in the raw representation files.
- `codes`: A list of codes to include in the flat representation. If `None`, all codes will be included
in the flat representation.
- `aggs`: A list of aggregations to apply to the raw representation. Must have length greater than 0.
- `n_patients_per_sub_shard`: The number of subjects that should be included in each output file.
Lowering this number increases the number of files written, making the process of creating and
leveraging these files slower but more memory efficient.
- `do_overwrite`: If `True`, this function will overwrite the data already stored in the target save
directory.
- `seed`: The seed to use for random number generation.
18 changes: 8 additions & 10 deletions configs/tabularize.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,29 +7,27 @@ min_code_inclusion_frequency: ???
window_sizes: ???
codes: null
aggs:
- "static/present"
- "static/first"
- "code/count"
- "code/time_since_last"
- "code/time_since_first"
- "value/count"
- "value/sum"
- "value/sum_sqd"
- "value/min"
- "value/time_since_min"
- "value/max"
- "value/time_since_max"
- "value/last"
- "value/slope"
- "value/intercept"
- "value/residual/sum"
- "value/residual/sum_sqd"

dynamic_threshold: 0.01
numerical_value_threshold: 0.1

# Sharding
n_patients_per_sub_shard: null

# Misc
do_overwrite: False
do_update: True
seed: 1
tqdm: False
worker: 1
test: False

# Hydra
hydra:
Expand Down
60 changes: 60 additions & 0 deletions configs/xgboost_sweep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Raw data
MEDS_cohort_dir: ???
tabularized_data_dir: ???
model_dir: ${tabularized_data_dir}/model

# Pre-processing
min_code_inclusion_frequency: 1
window_sizes: [1d]
codes: null
aggs:
- "code/count"
- "value/sum"

dynamic_threshold: 0.01
numerical_value_threshold: 0.1

# Sharding
n_patients_per_sub_shard: null

# Misc
do_overwrite: False
do_update: True
seed: 1
tqdm: True

model:
booster: gbtree
device: cpu
tree_method: hist
objective: reg:squarederror

iterator:
keep_data_in_memory: False

# Hydra settings for sweep
defaults:
- override hydra/sweeper: optuna
- override hydra/sweeper/sampler: tpe

hydra:
verbose: False
sweep:
dir: ${tabularized_data_dir}/.logs/etl/${now:%Y-%m-%d_%H-%M-%S}
run:
dir: ${tabularized_data_dir}/.logs/etl/${now:%Y-%m-%d_%H-%M-%S}

# Optuna Sweeper
sweeper:
sampler:
seed: 1
storage: null
study_name: tabularize_study_${now:%Y-%m-%d_%H-%M-%S}
direction: minimize
n_trials: 10

# Define search space for Optuna
params:
window_sizes: choice([30d, 365d, full], [30d, full], [30d])
# iterator.keep_static_data_in_memory: choice([True], [False])
# iterator.keep_data_in_memory: choice([True], [False])
13 changes: 13 additions & 0 deletions hf_cohort/aces_task.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/usr/bin/env bash

MEDS_DIR=/storage/shared/meds_tabular_ml/ebcl_dataset/processed/final_cohort
OUTPUT_DIR=/storage/shared/meds_tabular_ml/ebcl_dataset/processed/tabularize
# N_PARALLEL_WORKERS="$1"
WINDOW_SIZES="window_sizes=[1d]"
AGGS="aggs=[code/count,value/sum]"

python /home/nassim/projects/MEDS_Tabular_AutoML/hf_cohort/aces_task_extraction.py \
MEDS_cohort_dir=$MEDS_DIR \
tabularized_data_dir=$OUTPUT_DIR \
min_code_inclusion_frequency=1 do_overwrite=False \
"$WINDOW_SIZES" "$AGGS"
51 changes: 51 additions & 0 deletions hf_cohort/aces_task_extraction.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
"""
Setup Conda environment as described here: https://github.com/justin13601/ACES
"""
from pathlib import Path

import hydra
import polars as pl
from aces import config, predicates, query
from tqdm import tqdm


@hydra.main(version_base=None, config_path="../configs", config_name="tabularize")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

Codebase Verification

The Hydra configuration file configs/tabularize.yaml does exist in the repository, but the script's current working directory is /home/jailuser/git, which might not align with the relative path used in the Hydra decorator. This discrepancy can cause the configuration file to be inaccessible.

  • Ensure the script is executed from the correct directory or adjust the relative path accordingly.
Analysis chain

Ensure the config_path and config_name are correctly set for the Hydra configuration.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the Hydra configuration paths.

# Test: Check if the specified configuration files exist.
if [ -f "../configs/tabularize.yaml" ]; then
  echo "Hydra configuration file exists."
else
  echo "Hydra configuration file does not exist."
fi

Length of output: 130



Script:

#!/bin/bash
# Description: Search for YAML configuration files in the repository.

# Find all YAML files in the repository.
fd --extension yaml

Length of output: 87



Script:

#!/bin/bash
# Description: Check the current working directory when the script is executed.

# Print the current working directory.
pwd

Length of output: 24

def main(cfg):
# create task configuration object
task_cfg = config.TaskExtractorConfig.load(config_path="hf_cohort/task.yaml")

# setup directories
med_dir = Path(cfg.tabularized_data_dir)

# location of MEDS format Data
cohort_dir = med_dir.parent / "final_cohort"
# output directory for tables with event_ids and labels
output_dir = med_dir / "task"

shard_fps = list(cohort_dir.glob("*/*.parquet"))

for in_fp in tqdm(shard_fps):
out_fp = output_dir / "/".join(in_fp.parts[-2:])
out_fp.parent.mkdir(parents=True, exist_ok=True)
# one of the following
predicates_df = predicates.generate_predicates_df(task_cfg, in_fp, "meds")

# execute query
df_result = query.query(task_cfg, predicates_df)
label_df = (
df_result.select(pl.col(["subject_id", "trigger", "label"]))
.rename({"trigger": "timestamp", "subject_id": "patient_id"})
.sort(by=["patient_id", "timestamp"])
)
data_df = pl.scan_parquet(in_fp)
data_df = data_df.unique(subset=["patient_id", "timestamp"]).sort(by=["patient_id", "timestamp"])
data_df = data_df.with_row_index("event_id")
data_df = data_df.drop(["code", "numerical_value"])
output_df = label_df.lazy().join_asof(other=data_df, by="patient_id", on="timestamp")

# store it
output_df.collect().write_parquet(out_fp)


if __name__ == "__main__":
main()
21 changes: 21 additions & 0 deletions hf_cohort/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Path to the task configuration file
config_path: task.yaml

# Raw Data
data:
# Path to the data file or directory
path: /storage/shared/meds_tabular_ml/ebcl_dataset/processed/final_cohort/train/0.parquet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

Codebase Verification

The specified path in the configuration file is not accessible.

  • Ensure the path /storage/shared/meds_tabular_ml/ebcl_dataset/processed/final_cohort/train/0.parquet is correct and the file exists.
Analysis chain

Ensure the path is correct and accessible.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the accessibility of the specified data path.

# Test: Check if the path exists and is accessible.
if [ -d "/storage/shared/meds_tabular_ml/ebcl_dataset/processed/final_cohort/train/0.parquet" ]; then
  echo "Path is accessible."
else
  echo "Path is not accessible."
fi

Length of output: 153



Script:

#!/bin/bash
# Description: Verify the accessibility of the specified data path.

# Test: Check if the path exists and is accessible as a file.
if [ -f "/storage/shared/meds_tabular_ml/ebcl_dataset/processed/final_cohort/train/0.parquet" ]; then
  echo "Path is accessible."
else
  echo "Path is not accessible."
fi

Length of output: 153


# Data standard, one of (csv, meds, esgpt)
standard: meds

# Output Directory (saves as .parquet file)
output_dir: results/

# Hydra
hydra:
job:
name: ACES_${now:%Y-%m-%d_%H-%M-%S}
run:
dir: ${ACES_dir}/.logs/${hydra.job.name}
# aces-cli --config-dir='./' --config-name='config.yaml'
32 changes: 32 additions & 0 deletions hf_cohort/hf_cohort_e2e.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/usr/bin/env bash

MEDS_DIR=/storage/shared/meds_tabular_ml/ebcl_dataset/processed
OUTPUT_DIR=/storage/shared/meds_tabular_ml/ebcl_dataset/processed/tabularize
# N_PARALLEL_WORKERS="$1"
WINDOW_SIZES="window_sizes=[1d]"
AGGS="aggs=[code/count,value/sum]"
# WINDOW_SIZES="window_sizes=[1d,7d,30d,365d,full]"
# AGGS="aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]"

echo "Running identify_columns.py: Caching feature names and frequencies."
rm -rf $OUTPUT_DIR
POLARS_MAX_THREADS=32 python scripts/identify_columns.py \
MEDS_cohort_dir=$MEDS_DIR \
tabularized_data_dir=$OUTPUT_DIR \
min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"
Comment on lines +13 to +18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding error handling for the rm and python commands.

The script lacks error handling for critical operations like removing directories and running Python scripts. Adding error checks will improve the robustness of the script.

+ if ! rm -rf $OUTPUT_DIR; then
+    echo "Failed to remove $OUTPUT_DIR"
+    exit 1
+ fi
+ if ! POLARS_MAX_THREADS=32 python scripts/identify_columns.py \
+    MEDS_cohort_dir=$MEDS_DIR \
+    tabularized_data_dir=$OUTPUT_DIR \
+    min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"; then
+    echo "Python script failed"
+    exit 1
+ fi
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
echo "Running identify_columns.py: Caching feature names and frequencies."
rm -rf $OUTPUT_DIR
POLARS_MAX_THREADS=32 python scripts/identify_columns.py \
MEDS_cohort_dir=$MEDS_DIR \
tabularized_data_dir=$OUTPUT_DIR \
min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"
echo "Running identify_columns.py: Caching feature names and frequencies."
if ! rm -rf $OUTPUT_DIR; then
echo "Failed to remove $OUTPUT_DIR"
exit 1
fi
if ! POLARS_MAX_THREADS=32 python scripts/identify_columns.py \
MEDS_cohort_dir=$MEDS_DIR \
tabularized_data_dir=$OUTPUT_DIR \
min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"; then
echo "Python script failed"
exit 1
fi


echo "Running tabularize_static.py: tabularizing static data"
POLARS_MAX_THREADS=32 python scripts/tabularize_static.py \
MEDS_cohort_dir=$MEDS_DIR \
tabularized_data_dir=$OUTPUT_DIR \
min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"

echo "Running summarize_over_windows.py with $N_PARALLEL_WORKERS workers in parallel"
POLARS_MAX_THREADS=1 python scripts/summarize_over_windows.py \
--multirun \
worker="range(0,$N_PARALLEL_WORKERS)" \
hydra/launcher=joblib \
MEDS_cohort_dir=$MEDS_DIR \
tabularized_data_dir=$OUTPUT_DIR \
min_code_inclusion_frequency=1 do_overwrite=False \
"$WINDOW_SIZES" "$AGGS"
Loading
Loading