Skip to content

Commit 8b7859c

Browse files
committed
Merge branch '55_MEDS_v03' into 59_add_meds_dependency
2 parents 477e940 + 490308a commit 8b7859c

19 files changed

+158
-209
lines changed

README.md

+23-23
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ ______________________________________________________________________
2222

2323
This repository consists of two key pieces:
2424

25-
1. Construction and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time windows (e.g. 1 year, 6 months, etc.), which go backwards in time from a given index date.
25+
1. Construction and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time windows (e.g. 1 year, 6 months, etc.), which go backward in time from a given index date.
2626
2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems.
2727

2828
## Quick Start
@@ -45,8 +45,8 @@ pip install .
4545

4646
## Scripts and Examples
4747

48-
For an end to end example over MIMIC-IV, see the [MIMIC-IV companion repository](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV).
49-
For an end to end example over Philips eICU, see the [eICU companion repository](https://github.com/mmcdermott/MEDS_TAB_EICU).
48+
For an end-to-end example over MIMIC-IV, see the [MIMIC-IV companion repository](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV).
49+
For an end-to-end example over Philips eICU, see the [eICU companion repository](https://github.com/mmcdermott/MEDS_TAB_EICU).
5050

5151
See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.
5252

@@ -74,7 +74,7 @@ By following these steps, you can seamlessly transform your dataset, define nece
7474

7575
## Core CLI Scripts Overview
7676

77-
1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories:
77+
1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code types. It differentiates codes into the following categories:
7878

7979
- time-series codes (codes with timestamps)
8080
- time-series numerical values (codes with timestamps and numerical values)
@@ -95,9 +95,9 @@ By following these steps, you can seamlessly transform your dataset, define nece
9595
tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]"
9696
```
9797

98-
- For the exhuastive examples of value aggregations, see [`/src/MEDS_tabular_automl/utils.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/src/MEDS_tabular_automl/utils.py#L24)
98+
- For the exhaustive examples of value aggregations, see [`/src/MEDS_tabular_automl/utils.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/src/MEDS_tabular_automl/utils.py#L24)
9999

100-
3. **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard, `window_size`, and `aggregation` to generate feature vectors that aggregate patient data for each unique `patient_id` x `timestamp`. This stage (and the previous stage) use sparse matrix formats to efficiently handle the computational and storage demands of rolling window calculations on large datasets. We support parallelization through Hydra's [`--multirun`](https://hydra.cc/docs/intro/#multirun) flag and the [`joblib` launcher](https://hydra.cc/docs/plugins/joblib_launcher/#internaldocs-banner).
100+
3. **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard, `window_size`, and `aggregation` to generate feature vectors that aggregate patient data for each unique `patient_id` x `timestamp`. This stage (and the previous stage) uses sparse matrix formats to efficiently handle the computational and storage demands of rolling window calculations on large datasets. We support parallelization through Hydra's [`--multirun`](https://hydra.cc/docs/intro/#multirun) flag and the [`joblib` launcher](https://hydra.cc/docs/plugins/joblib_launcher/#internaldocs-banner).
101101

102102
**Example: Aggregate time-series data** on features across different `window_sizes`
103103

@@ -125,34 +125,34 @@ By following these steps, you can seamlessly transform your dataset, define nece
125125
tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]
126126
```
127127

128-
5. **`meds-tab-xgboost`**: Trains an XGBoost model using user-specified parameters. Permutations of `window_sizes` and `aggs` can be generated using `generate-permutations` command (See the section below for descriptions).
128+
5. **`meds-tab-xgboost`**: Trains an XGBoost model using user-specified parameters. Permutations of `window_sizes` and `aggs` can be generated using `generate-subsets` command (See the section below for descriptions).
129129

130130
```console
131131
meds-tab-xgboost --multirun \
132132
MEDS_cohort_dir="path_to_data" \
133133
task_name=$TASK \
134134
output_dir="output_directory" \
135135
tabularization.min_code_inclusion_frequency=10 \
136-
tabularization.window_sizes=$(generate-permutations [1d,30d,365d,full]) \
136+
tabularization.window_sizes=$(generate-subsets [1d,30d,365d,full]) \
137137
do_overwrite=False \
138-
tabularization.aggs=$(generate-permutations [static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max])
138+
tabularization.aggs=$(generate-subsets [static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max])
139139
```
140140

141141
## Additional CLI Scripts
142142

143-
1. **`generate-permutations`**: Generates and prints a sorted list of all permutations from a comma separated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations.
143+
1. **`generate-subsets`**: Generates and prints a sorted list of all non-empty subsets from a comma-separated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations.
144144

145-
For example you can directly call **`generate-permutations`** in the command line:
145+
For example, you can directly call **`generate-subsets`** in the command line:
146146

147147
```console
148-
generate-permutations [2,3,4] \
148+
generate-subsets [2,3,4] \
149149
[2], [2, 3], [2, 3, 4], [2, 4], [3], [3, 4], [4]
150150
```
151151

152152
This could be used in the command line in concert with other calls. For example, the following call:
153153

154154
```console
155-
meds-tab-xgboost --multirun tabularization.window_sizes=$(generate-permutations [1d,2d,7d,full])
155+
meds-tab-xgboost --multirun tabularization.window_sizes=$(generate-subsets [1d,2d,7d,full])
156156
```
157157

158158
would resolve to:
@@ -299,7 +299,7 @@ Now that we have generated tabular features for all the events in our dataset, w
299299
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.
300300
- **Use of Sparse Matrices for Efficient Storage**: Sparse matrices are again employed here to store the selected data efficiently, ensuring that only non-zero data points are kept in memory, thus optimizing both storage and retrieval times.
301301

302-
The file structure for the cached data mirrors that of the tabular data, also consisting of `.npz` files, where users must specify the directory that stores labels. Labels follow the same shard filestructure as the input meds data from step (1), and the label parquets need `patient_id`, `timestamp`, and `label` columns.
302+
The file structure for the cached data mirrors that of the tabular data, also consisting of `.npz` files, where users must specify the directory that stores labels. Labels follow the same shard file structure as the input meds data from step (1), and the label parquets need `patient_id`, `timestamp`, and `label` columns.
303303

304304
## 4. XGBoost Training
305305

@@ -309,7 +309,7 @@ The final stage uses the processed and cached data to train an XGBoost model. Th
309309

310310
- **Iterator for Data Loading**: Custom iterators are designed to load sparse matrices efficiently into the XGBoost training process, which can handle sparse inputs natively, thus maintaining high computational efficiency.
311311
- **Training and Validation**: The model is trained using the tabular data, with evaluation steps that include early stopping to prevent overfitting and tuning of hyperparameters based on validation performance.
312-
- **Hyperaparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model pramters, aggregations, window sizes, and the minimimum code inclusion frequency.
312+
- **Hyperparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model parameters, aggregations, window sizes, and the minimum code inclusion frequency.
313313

314314
______________________________________________________________________
315315

@@ -332,15 +332,15 @@ The benchmarking tests were conducted using the following hardware and software
332332

333333
### MEDS-Tab Tabularization Technique
334334

335-
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.
335+
Tabularization of time-series data, as depicted above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory-efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory-efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets, we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patients' data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.
336336

337337
## 2. Comparative Performance Analysis
338338

339-
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
339+
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such a small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it was never completed within the 10-minute budget.
340340

341341
### eICU Dataset
342342

343-
The only method that was able to tabularize eICU data was MEDS-Tab. We ran our method with both 100 and 500 patients, resulting in an increment by three times in the number of codes. MEDS-Tab gave efficient results in terms of both time and memory usage.
343+
The only method that was able to tabularize eICU data was MEDS-Tab. We ran our method with both 100 and 500 patients, resulting in an increment of three times in the number of codes. MEDS-Tab gave efficient results in terms of both time and memory usage.
344344

345345
a) 100 Patients
346346

@@ -420,7 +420,7 @@ meds-tab-xgboost
420420
do_overwrite=False \
421421
```
422422

423-
This uses the defaults minimum code inclusion frequency, window sizes, and aggregations from the `launch_xgboost.yaml`:
423+
This uses the default minimum code inclusion frequency, window sizes, and aggregations from the `launch_xgboost.yaml`:
424424

425425
```yaml
426426
allowed_codes: # allows all codes that meet min code inclusion frequency
@@ -487,9 +487,9 @@ meds-tab-xgboost --multirun \
487487
MEDS_cohort_dir="path_to_data" \
488488
task_name=$TASK \
489489
output_dir="output_directory" \
490-
tabularization.window_sizes=$(generate-permutations [1d,30d,365d,full]) \
490+
tabularization.window_sizes=$(generate-subsets [1d,30d,365d,full]) \
491491
do_overwrite=False \
492-
tabularization.aggs=$(generate-permutations [static/present,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max])
492+
tabularization.aggs=$(generate-subsets [static/present,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max])
493493
```
494494

495495
The model parameters were set to:
@@ -542,7 +542,7 @@ For a complete example on MIMIC-IV and for all of our config files, see the [MIM
542542

543543
#### 2.2 XGBoost Optimal Found Model Parameters
544544

545-
Additionally, the model parameters from the highest performing run are reported below.
545+
Additionally, the model parameters from the highest-performing run are reported below.
546546

547547
| Task | Index Timestamp | Eta | Lambda | Alpha | Subsample | Minimum Child Weight | Number of Boosting Rounds | Early Stopping Rounds | Max Tree Depth |
548548
| ------------------------------- | ----------------- | ----- | ------ | ----- | --------- | -------------------- | ------------------------- | --------------------- | -------------- |
@@ -564,7 +564,7 @@ Additionally, the model parameters from the highest performing run are reported
564564

565565
The eICU sweep was conducted equivalently to the MIMIC-IV sweep. Please refer to the MIMIC-IV Sweep subsection above for details on the commands and sweep parameters.
566566

567-
For more details about eICU specific task generation and running, see the [eICU companion repository](https://github.com/mmcdermott/MEDS_TAB_EICU).
567+
For more details about eICU-specific task generation and running, see the [eICU companion repository](https://github.com/mmcdermott/MEDS_TAB_EICU).
568568

569569
#### 1. XGBoost Performance on eICU
570570

0 commit comments

Comments
 (0)