You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. Construction and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time windows (e.g. 1 year, 6 months, etc.), which go backwards in time from a given index date.
25
+
1. Construction and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time windows (e.g. 1 year, 6 months, etc.), which go backward in time from a given index date.
26
26
2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems.
27
27
28
28
## Quick Start
@@ -45,8 +45,8 @@ pip install .
45
45
46
46
## Scripts and Examples
47
47
48
-
For an end to end example over MIMIC-IV, see the [MIMIC-IV companion repository](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV).
49
-
For an end to end example over Philips eICU, see the [eICU companion repository](https://github.com/mmcdermott/MEDS_TAB_EICU).
48
+
For an end-to-end example over MIMIC-IV, see the [MIMIC-IV companion repository](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV).
49
+
For an end-to-end example over Philips eICU, see the [eICU companion repository](https://github.com/mmcdermott/MEDS_TAB_EICU).
50
50
51
51
See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.
52
52
@@ -74,7 +74,7 @@ By following these steps, you can seamlessly transform your dataset, define nece
74
74
75
75
## Core CLI Scripts Overview
76
76
77
-
1.**`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories:
77
+
1.**`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different codetypes. It differentiates codes into the following categories:
78
78
79
79
- time-series codes (codes with timestamps)
80
80
- time-series numerical values (codes with timestamps and numerical values)
@@ -95,9 +95,9 @@ By following these steps, you can seamlessly transform your dataset, define nece
- For the exhuastive examples of value aggregations, see [`/src/MEDS_tabular_automl/utils.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/src/MEDS_tabular_automl/utils.py#L24)
98
+
- For the exhaustive examples of value aggregations, see [`/src/MEDS_tabular_automl/utils.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/src/MEDS_tabular_automl/utils.py#L24)
99
99
100
-
3.**`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard, `window_size`, and `aggregation` to generate feature vectors that aggregate patient data for each unique `patient_id` x `timestamp`. This stage (and the previous stage) use sparse matrix formats to efficiently handle the computational and storage demands of rolling window calculations on large datasets. We support parallelization through Hydra's [`--multirun`](https://hydra.cc/docs/intro/#multirun) flag and the [`joblib` launcher](https://hydra.cc/docs/plugins/joblib_launcher/#internaldocs-banner).
100
+
3.**`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard, `window_size`, and `aggregation` to generate feature vectors that aggregate patient data for each unique `patient_id` x `timestamp`. This stage (and the previous stage) uses sparse matrix formats to efficiently handle the computational and storage demands of rolling window calculations on large datasets. We support parallelization through Hydra's [`--multirun`](https://hydra.cc/docs/intro/#multirun) flag and the [`joblib` launcher](https://hydra.cc/docs/plugins/joblib_launcher/#internaldocs-banner).
101
101
102
102
**Example: Aggregate time-series data** on features across different `window_sizes`
103
103
@@ -125,34 +125,34 @@ By following these steps, you can seamlessly transform your dataset, define nece
5.**`meds-tab-xgboost`**: Trains an XGBoost model using user-specified parameters. Permutations of `window_sizes` and `aggs` can be generated using `generate-permutations` command (See the section below for descriptions).
128
+
5.**`meds-tab-xgboost`**: Trains an XGBoost model using user-specified parameters. Permutations of `window_sizes` and `aggs` can be generated using `generate-subsets` command (See the section below for descriptions).
1.**`generate-permutations`**: Generates and prints a sorted list of all permutations from a commaseparated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations.
143
+
1.**`generate-subsets`**: Generates and prints a sorted list of all non-empty subsets from a comma-separated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations.
144
144
145
-
For example you can directly call **`generate-permutations`** in the command line:
145
+
For example, you can directly call **`generate-subsets`** in the command line:
146
146
147
147
```console
148
-
generate-permutations [2,3,4] \
148
+
generate-subsets [2,3,4] \
149
149
[2], [2, 3], [2, 3, 4], [2, 4], [3], [3, 4], [4]
150
150
```
151
151
152
152
This could be used in the command line in concert with other calls. For example, the following call:
@@ -299,7 +299,7 @@ Now that we have generated tabular features for all the events in our dataset, w
299
299
-**Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.
300
300
-**Use of Sparse Matrices for Efficient Storage**: Sparse matrices are again employed here to store the selected data efficiently, ensuring that only non-zero data points are kept in memory, thus optimizing both storage and retrieval times.
301
301
302
-
The file structure for the cached data mirrors that of the tabular data, also consisting of `.npz` files, where users must specify the directory that stores labels. Labels follow the same shard filestructure as the input meds data from step (1), and the label parquets need `patient_id`, `timestamp`, and `label` columns.
302
+
The file structure for the cached data mirrors that of the tabular data, also consisting of `.npz` files, where users must specify the directory that stores labels. Labels follow the same shard file structure as the input meds data from step (1), and the label parquets need `patient_id`, `timestamp`, and `label` columns.
303
303
304
304
## 4. XGBoost Training
305
305
@@ -309,7 +309,7 @@ The final stage uses the processed and cached data to train an XGBoost model. Th
309
309
310
310
-**Iterator for Data Loading**: Custom iterators are designed to load sparse matrices efficiently into the XGBoost training process, which can handle sparse inputs natively, thus maintaining high computational efficiency.
311
311
-**Training and Validation**: The model is trained using the tabular data, with evaluation steps that include early stopping to prevent overfitting and tuning of hyperparameters based on validation performance.
312
-
-**Hyperaparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model pramters, aggregations, window sizes, and the minimimum code inclusion frequency.
312
+
-**Hyperparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model parameters, aggregations, window sizes, and the minimum code inclusion frequency.
@@ -332,15 +332,15 @@ The benchmarking tests were conducted using the following hardware and software
332
332
333
333
### MEDS-Tab Tabularization Technique
334
334
335
-
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memoryefficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memoryefficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.
335
+
Tabularization of time-series data, as depicted above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory-efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory-efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets, we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patients' data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.
336
336
337
337
## 2. Comparative Performance Analysis
338
338
339
-
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10minute budget.
339
+
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such a small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it was never completed within the 10-minute budget.
340
340
341
341
### eICU Dataset
342
342
343
-
The only method that was able to tabularize eICU data was MEDS-Tab. We ran our method with both 100 and 500 patients, resulting in an increment by three times in the number of codes. MEDS-Tab gave efficient results in terms of both time and memory usage.
343
+
The only method that was able to tabularize eICU data was MEDS-Tab. We ran our method with both 100 and 500 patients, resulting in an increment of three times in the number of codes. MEDS-Tab gave efficient results in terms of both time and memory usage.
344
344
345
345
a) 100 Patients
346
346
@@ -420,7 +420,7 @@ meds-tab-xgboost
420
420
do_overwrite=False \
421
421
```
422
422
423
-
This uses the defaults minimum code inclusion frequency, window sizes, and aggregations from the `launch_xgboost.yaml`:
423
+
This uses the default minimum code inclusion frequency, window sizes, and aggregations from the `launch_xgboost.yaml`:
424
424
425
425
```yaml
426
426
allowed_codes: # allows all codes that meet min code inclusion frequency
@@ -542,7 +542,7 @@ For a complete example on MIMIC-IV and for all of our config files, see the [MIM
542
542
543
543
#### 2.2 XGBoost Optimal Found Model Parameters
544
544
545
-
Additionally, the model parameters from the highestperforming run are reported below.
545
+
Additionally, the model parameters from the highest-performing run are reported below.
546
546
547
547
| Task | Index Timestamp | Eta | Lambda | Alpha | Subsample | Minimum Child Weight | Number of Boosting Rounds | Early Stopping Rounds | Max Tree Depth |
@@ -564,7 +564,7 @@ Additionally, the model parameters from the highest performing run are reported
564
564
565
565
The eICU sweep was conducted equivalently to the MIMIC-IV sweep. Please refer to the MIMIC-IV Sweep subsection above for details on the commands and sweep parameters.
566
566
567
-
For more details about eICUspecific task generation and running, see the [eICU companion repository](https://github.com/mmcdermott/MEDS_TAB_EICU).
567
+
For more details about eICU-specific task generation and running, see the [eICU companion repository](https://github.com/mmcdermott/MEDS_TAB_EICU).
0 commit comments