-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting Esgpt caching to work for MEDS datasets #1
Changes from 26 commits
d5ff0df
4a486aa
19f0f4e
fd1731f
63b9ba6
cd067f8
c8ca3bb
7fdc37d
4dd3cad
720a533
548e29a
d9ba7e7
f0b1cbb
ba954ef
df2750a
4bbbc20
d39bf1a
41fe4b4
cb5f689
8bc9a16
1e27526
97938a8
c28e6b2
6f3b1ec
6753609
f125600
2acc3bc
eec05e2
bd9bdae
29c8c5f
4c7d3e7
ba796e5
3678d30
82b3903
ffa0f3c
c8f26ea
f6a3751
2ec1860
db18dc5
abba3d2
77f296f
e8f26eb
958906d
7668382
b6b8d43
e6a88a7
e8d64fd
5b2f7f7
5c5dc8e
357845e
d99e274
cadc603
285ccbf
795b532
b9d057b
85f38b5
7ea3230
23a2e3b
c225c47
cb21821
3a412a0
a4f1843
800ab7e
820e194
4b0637a
127d04a
23877ad
36f54a3
81bf2d9
83c4eec
e7a85ba
35acb97
bef63b6
e9775e2
c8f4144
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,10 @@ | ||
# Scalable tabularization and tabular feature usage utilities over generic MEDS datasets | ||
|
||
This repository provides utilities and scripts to run limited automatic tabular ML pipelines for generic MEDS | ||
datasets. | ||
|
||
#### Q1: What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular? | ||
|
||
This is a common misconception. _Tabular_ data refers to data that can be organized in a consistent, logical | ||
set of rows/columns such that the entirety of a "sample" or "instance" for modeling or analysis is contained | ||
in a single row, and the set of columns possibly observed (there can be missingness) is consistent across all | ||
|
@@ -15,28 +17,60 @@ or future windows in time to produce a single row per patient with a consistent, | |
(though there may still be missingness). | ||
|
||
#### Q2: Why not other systems? | ||
- [TemporAI](https://github.com/vanderschaarlab/temporai) is the most natural competitor, and already | ||
supports AutoML capabilities. However, TemporAI (as of now) does not support generic MEDS datasets, and it | ||
is not clear if their AutoML systems will scale to the size of datasets we need to support. But, further | ||
investigation is needed, and it may be the case that the best solution here is simply to write a custom | ||
data source for MEDS data within TemporAI and leverage their tools. | ||
|
||
- [TemporAI](https://github.com/vanderschaarlab/temporai) is the most natural competitor, and already | ||
supports AutoML capabilities. However, TemporAI (as of now) does not support generic MEDS datasets, and it | ||
is not clear if their AutoML systems will scale to the size of datasets we need to support. But, further | ||
investigation is needed, and it may be the case that the best solution here is simply to write a custom | ||
data source for MEDS data within TemporAI and leverage their tools. | ||
|
||
# Installation | ||
|
||
Clone this repository and install the requirements by running `pip install .` in the root directory. | ||
|
||
# Usage | ||
|
||
This repository consists of two key pieces: | ||
1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing | ||
patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) either backwards or | ||
forwards in time from a given index date. Naturally, only "look-back" windows should be used for | ||
future-event prediction tasks; however, the capability to summarize "look-ahead" windows is also useful | ||
for characterizing and describing the differences between patient populations statistically. | ||
2. Running basic AutoML pipelines over these tabular features to predict arbitrary binary classification | ||
downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- | ||
what is more advanced is the efficient construction, storage, and loading of tabular features for the | ||
candidate AutoML models, enabling a far more extensive search over different featurization strategies. | ||
|
||
1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing | ||
patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) either backwards or | ||
forwards in time from a given index date. Naturally, only "look-back" windows should be used for | ||
future-event prediction tasks; however, the capability to summarize "look-ahead" windows is also useful | ||
for characterizing and describing the differences between patient populations statistically. | ||
2. Running basic AutoML pipelines over these tabular features to predict arbitrary binary classification | ||
downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- | ||
what is more advanced is the efficient construction, storage, and loading of tabular features for the | ||
candidate AutoML models, enabling a far more extensive search over different featurization strategies. | ||
|
||
### Scripts and Examples | ||
|
||
See `tests/test_tabularize_integration.py` for an example of the end-to-end pipeline being run on synthetic data. This | ||
script is a functional test that is also run with `pytest` to verify the correctness of the algorithm. | ||
#### Core Scripts: | ||
|
||
1. `scripts/tabularize/identify_columns.py` loads all training shard to identify which feature columns | ||
to generate tabular data for. | ||
|
||
```bash | ||
POLARS_MAX_THREADS=32 python scripts/identify_columns.py MEDS_cohort_dir=/storage/shared/meds_tabular_ml/ebcl_dataset/processed/final_cohort tabularized_data_dir=/storage/shared/meds_tabular_ml/ebcl_dataset/processed/tabularize min_code_inclusion_frequency=1 "window_sizes=[1d, 7d, full]" do_overwrite=True | ||
``` | ||
|
||
2. `scripts/tabularize/tabularize_static.py` Iterates through shards and generates tabular vectors for | ||
each patient. There is a single row per patient for each shard. | ||
|
||
```bash | ||
POLARS_MAX_THREADS=32 python scripts/tabularize_static.py MEDS_cohort_dir=/storage/shared/meds_tabular_ml/ebcl_dataset/processed/final_cohort tabularized_data_dir=/storage/shared/meds_tabular_ml/ebcl_dataset/processed/tabularize min_code_inclusion_frequency=1 "window_sizes=[1d, 7d, full]" do_overwrite=True | ||
``` | ||
|
||
4. `scripts/tabularize/summarize_over_windows.py` For each shard, iterates through window sizes and aggregations to | ||
and horizontally concatenates the outputs to generate the final tabular representations at every event time for every patient. | ||
|
||
```bash | ||
POLARS_MAX_THREADS=1 python scripts/summarize_over_windows.py MEDS_cohort_dir=/storage/shared/meds_tabular_ml/ebcl_dataset/processed/final_cohort tabularized_data_dir=/storage/shared/meds_tabular_ml/ebcl_dataset/processed/tabularize min_code_inclusion_frequency=1 "window_sizes=[1d, 7d, full]" do_overwrite=True | ||
``` | ||
|
||
## Feature Construction, Storage, and Loading | ||
|
||
Tabularization of a (raw) MEDS dataset is done by running the `scripts/data/tabularize.py` script. This script | ||
must inherently do a base level of preprocessing over the MEDS data, then will construct a sharded tabular | ||
representation that respects the overall sharding of the raw data. This script uses [Hydra](https://hydra.cc/) | ||
|
@@ -45,14 +79,39 @@ to manage configuration, and the configuration file is located at `configs/tabul | |
## AutoML Pipelines | ||
|
||
# TODOs | ||
mmcdermott marked this conversation as resolved.
Show resolved
Hide resolved
|
||
1. Leverage the "event bound aggregation" capabilities of [ESGPT Task | ||
Select](https://github.com/justin13601/ESGPTTaskQuerying/) to construct tabular summary features for | ||
event-bound historical windows (e.g., until the prior admission, until the last diagnosis of some type, | ||
etc.). | ||
2. Support more feature aggregation functions. | ||
3. Probably rename this repository, as the focus is really more on the tabularization and feature usage | ||
utilities than on the AutoML pipelines themselves. | ||
4. Import, rather than reimplement, the mapper utilities from the MEDS preprocessing repository. | ||
5. Investigate the feasibility of using TemporAI for this task. | ||
6. Consider splitting the feature construction and AutoML pipeline parts of this repository into separate | ||
repositories. | ||
|
||
1. Leverage the "event bound aggregation" capabilities of [ESGPT Task | ||
Select](https://github.com/justin13601/ESGPTTaskQuerying/) to construct tabular summary features for | ||
event-bound historical windows (e.g., until the prior admission, until the last diagnosis of some type, | ||
etc.). | ||
2. Support more feature aggregation functions. | ||
3. Probably rename this repository, as the focus is really more on the tabularization and feature usage | ||
utilities than on the AutoML pipelines themselves. | ||
4. Import, rather than reimplement, the mapper utilities from the MEDS preprocessing repository. | ||
5. Investigate the feasibility of using TemporAI for this task. | ||
6. Consider splitting the feature construction and AutoML pipeline parts of this repository into separate | ||
repositories. | ||
|
||
Comment on lines
+84
to
+96
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Address the TODOs to ensure they are actively tracked and prioritized. Would you like me to help create GitHub issues for these TODOs to ensure they are not overlooked? |
||
# YAML Configuration File | ||
|
||
- `MEDS_cohort_dir`: directory of MEDS format dataset that is ingested. | ||
- `tabularized_data_dir`: output directory of tabularized data. | ||
- `min_code_inclusion_frequency`: The base feature inclusion frequency that should be used to dictate | ||
what features can be included in the flat representation. It can either be a float, in which | ||
case it applies across all measurements, or `None`, in which case no filtering is applied, or | ||
a dictionary from measurement type to a float dictating a per-measurement-type inclusion | ||
cutoff. | ||
- `window_sizes`: Beyond writing out a raw, per-event flattened representation, the dataset also has | ||
the capability to summarize these flattened representations over the historical windows | ||
specified in this argument. These are strings specifying time deltas, using this syntax: | ||
`link`\_. Each window size will be summarized to a separate directory, and will share the same | ||
subject file split as is used in the raw representation files. | ||
- `codes`: A list of codes to include in the flat representation. If `None`, all codes will be included | ||
in the flat representation. | ||
- `aggs`: A list of aggregations to apply to the raw representation. Must have length greater than 0. | ||
- `n_patients_per_sub_shard`: The number of subjects that should be included in each output file. | ||
Lowering this number increases the number of files written, making the process of creating and | ||
leveraging these files slower but more memory efficient. | ||
- `do_overwrite`: If `True`, this function will overwrite the data already stored in the target save | ||
directory. | ||
- `seed`: The seed to use for random number generation. |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Eventually we'll want to pull these MGH HF dataset specific files out and have a MIMIC example or something here instead, but for now this is fine. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
#!/usr/bin/env bash | ||
|
||
MEDS_DIR=/storage/shared/meds_tabular_ml/ebcl_dataset/processed/final_cohort | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Eventually, we'll want to make these args in this script (for the general release), but for now this is fine. |
||
OUTPUT_DIR=/storage/shared/meds_tabular_ml/ebcl_dataset/processed/tabularize | ||
N_PARALLEL_WORKERS="$1" | ||
WINDOW_SIZES="window_sizes=[1d]" | ||
AGGS="aggs=[code/count,value/sum]" | ||
# WINDOW_SIZES="window_sizes=[1d,7d,30d,365d,full]" | ||
# AGGS="aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]" | ||
|
||
echo "Running identify_columns.py: Caching feature names and frequencies." | ||
rm -rf $OUTPUT_DIR | ||
POLARS_MAX_THREADS=32 python scripts/identify_columns.py \ | ||
MEDS_cohort_dir=$MEDS_DIR \ | ||
tabularized_data_dir=$OUTPUT_DIR \ | ||
min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS" | ||
|
||
echo "Running tabularize_static.py: tabularizing static data" | ||
POLARS_MAX_THREADS=32 python scripts/tabularize_static.py \ | ||
MEDS_cohort_dir=$MEDS_DIR \ | ||
tabularized_data_dir=$OUTPUT_DIR \ | ||
min_code_inclusion_frequency=1 $WINDOW_SIZES do_overwrite=False $AGGS | ||
|
||
# echo "Running summarize_over_windows.py with $N_PARALLEL_WORKERS workers in parallel" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm assuming this will eventually either be deleted or uncommented and replace the serial block. |
||
# POLARS_MAX_THREADS=1 python scripts/summarize_over_windows.py \ | ||
# --multirun \ | ||
# worker="range(0,$N_PARALLEL_WORKERS)" \ | ||
# hydra/launcher=joblib \ | ||
# MEDS_cohort_dir=$MEDS_DIR \ | ||
# tabularized_data_dir=$OUTPUT_DIR \ | ||
# min_code_inclusion_frequency=1 do_overwrite=False \ | ||
# $WINDOW_SIZES $AGGS | ||
|
||
echo "Running summarize_over_windows.py" | ||
POLARS_MAX_THREADS=1 python scripts/summarize_over_windows.py \ | ||
MEDS_cohort_dir=$MEDS_DIR \ | ||
tabularized_data_dir=$OUTPUT_DIR \ | ||
min_code_inclusion_frequency=1 do_overwrite=False \ | ||
$WINDOW_SIZES $AGGS |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ditto MIMIC example comment, and we'll want to dedupe the two .sh scripts. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
#!/usr/bin/env bash | ||
OUTPUT_DIR=/data/storage/shared/meds_tabular_ml/ebcl_dataset/processed | ||
PATIENTS_PER_SHARD="2500" | ||
CHUNKSIZE="200_000_000" | ||
|
||
rm -rf $OUTPUT_DIR | ||
|
||
echo "Running shard_events.py" | ||
POLARS_MAX_THREADS=32 python /home/nassim/projects/MEDS_polars_functions/scripts/extraction/shard_events.py \ | ||
raw_cohort_dir=/data/storage/shared/meds_tabular_ml/ebcl_dataset \ | ||
MEDS_cohort_dir=$OUTPUT_DIR \ | ||
event_conversion_config_fp=/data/storage/shared/meds_tabular_ml/ebcl_dataset/cohort.yaml \ | ||
split_fracs.train=0.6666666666666666 split_fracs.tuning=0.16666666666666666 \ | ||
split_fracs.held_out=0.16666666666666666 row_chunksize=$CHUNKSIZE \ | ||
n_patients_per_shard=$PATIENTS_PER_SHARD hydra.verbose=True | ||
|
||
echo "Running split_and_shard_patients.py" | ||
POLARS_MAX_THREADS=32 python /home/nassim/projects/MEDS_polars_functions/scripts/extraction/split_and_shard_patients.py \ | ||
raw_cohort_dir=/data/storage/shared/meds_tabular_ml/ebcl_dataset \ | ||
MEDS_cohort_dir=$OUTPUT_DIR \ | ||
event_conversion_config_fp=/data/storage/shared/meds_tabular_ml/ebcl_dataset/cohort.yaml \ | ||
split_fracs.train=0.6666666666666666 split_fracs.tuning=0.16666666666666666 \ | ||
split_fracs.held_out=0.16666666666666666 row_chunksize=$CHUNKSIZE \ | ||
n_patients_per_shard=$PATIENTS_PER_SHARD hydra.verbose=True | ||
|
||
echo "Running convert_to_sharded_events.py" | ||
POLARS_MAX_THREADS=32 python /home/nassim/projects/MEDS_polars_functions/scripts/extraction/convert_to_sharded_events.py \ | ||
raw_cohort_dir=/data/storage/shared/meds_tabular_ml/ebcl_dataset \ | ||
MEDS_cohort_dir=$OUTPUT_DIR \ | ||
event_conversion_config_fp=/data/storage/shared/meds_tabular_ml/ebcl_dataset/cohort.yaml \ | ||
split_fracs.train=0.6666666666666666 split_fracs.tuning=0.16666666666666666 \ | ||
split_fracs.held_out=0.16666666666666666 row_chunksize=$CHUNKSIZE \ | ||
n_patients_per_shard=$PATIENTS_PER_SHARD hydra.verbose=True | ||
|
||
echo "Running merge_to_MEDS_cohort.py" | ||
POLARS_MAX_THREADS=32 python /home/nassim/projects/MEDS_polars_functions/scripts/extraction/merge_to_MEDS_cohort.py \ | ||
raw_cohort_dir=/data/storage/shared/meds_tabular_ml/ebcl_dataset \ | ||
MEDS_cohort_dir=$OUTPUT_DIR \ | ||
event_conversion_config_fp=/data/storage/shared/meds_tabular_ml/ebcl_dataset/cohort.yaml \ | ||
split_fracs.train=0.6666666666666666 split_fracs.tuning=0.16666666666666666 \ | ||
split_fracs.held_out=0.16666666666666666 row_chunksize=$CHUNKSIZE \ | ||
n_patients_per_shard=$PATIENTS_PER_SHARD hydra.verbose=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify the usage instructions to ensure they are easy to follow.