From 29c073fffb3127af8c2b4996494b276e6e1a474c Mon Sep 17 00:00:00 2001 From: Nassim Oufattole Date: Thu, 24 Oct 2024 17:02:06 -0400 Subject: [PATCH] included documentation md files to mdformat (except index.md and tutorial.md which import existing readme files that are formatted, as mdformat is not compatible with the import syntax) --- .pre-commit-config.yaml | 3 +- README.md | 6 +- docs/prediction.md | 42 +++--- docs/terminology.md | 85 ++++++------ docs/usage_guide.md | 292 ++++++++++++++++++++++++---------------- 5 files changed, 246 insertions(+), 182 deletions(-) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 8426225..b3b8375 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -88,8 +88,8 @@ repos: rev: 0.7.17 hooks: - id: mdformat - exclude: "^docs/.*\\.md$" args: ["--number"] + exclude: "docs/tutorial.md|docs/index.md" additional_dependencies: - mdformat-gfm - mdformat-tables @@ -98,6 +98,7 @@ repos: - mdformat-black - mdformat-config - mdformat-shfmt + - mdformat-mkdocs # word spelling linter - repo: https://github.com/codespell-project/codespell diff --git a/README.md b/README.md index 133e290..f525087 100644 --- a/README.md +++ b/README.md @@ -139,10 +139,10 @@ MEDS-Tab has several key limitations which we plan to address in future changes. ### Technical debt / code improvements 1. The computation and use of the code metadata dataframe, containing frequencies of codes, should be offloaded to core MEDS functionality, with the remaining code in this repository cleaned up. - - [#28](https://github.com/mmcdermott/MEDS_Tabular_AutoML/issues/28) + - [#28](https://github.com/mmcdermott/MEDS_Tabular_AutoML/issues/28) 2. We should add more doctests and push test coverage up to 100% - - [#29](https://github.com/mmcdermott/MEDS_Tabular_AutoML/issues/29) - - [#30](https://github.com/mmcdermott/MEDS_Tabular_AutoML/issues/30) + - [#29](https://github.com/mmcdermott/MEDS_Tabular_AutoML/issues/29) + - [#30](https://github.com/mmcdermott/MEDS_Tabular_AutoML/issues/30) ## What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular? diff --git a/docs/prediction.md b/docs/prediction.md index 0e634ff..c7732d7 100644 --- a/docs/prediction.md +++ b/docs/prediction.md @@ -116,18 +116,18 @@ For a complete example on MIMIC-IV and for all of our config files, see the [MIM #### 2.1 XGBoost Performance on MIMIC-IV | Task | Index Timestamp | AUC | Minimum Code Inclusion Count | Number of Included Codes\* | Window Sizes | Aggregations | -| ------------------------------- | ----------------- | ----- | -------------------------------- | -------------------------- | ---------------------- | --------------------------------------------------------------------------- | -| Post-discharge 30 day Mortality | Discharge | 0.935 | 1,371 | 5,712 | \[7d,full\] | \[code/count,value/count,value/min,value/max\] | -| Post-discharge 1 year Mortality | Discharge | 0.898 | 289 | 10,048 | \[2h,12h,1d,30d,full\] | \[static/present,code/count,value/sum_sqd,value/min\] | -| 30 day Readmission | Discharge | 0.708 | 303 | 9,903 | \[30d,365d,full\] | \[code/count,value/count,value/sum,value/sum_sqd,value/max\] | -| In ICU Mortality | Admission + 24 hr | 0.661 | 7,059 | 3,037 | \[12h,full\] | \[static/present,code/count,value/sum,value/min,value/max\] | -| In ICU Mortality | Admission + 48 hr | 0.673 | 71 | 16,112 | \[1d,7d,full\] | \[static/present,code/count,value/sum,value/min,value/max\] | -| In Hospital Mortality | Admission + 24 hr | 0.812 | 43 | 18,989 | \[1d,full\] | \[static/present,code/count,value/sum,value/min,value/max\] | -| In Hospital Mortality | Admission + 48 hr | 0.810 | 678 | 7,433 | \[1d,full\] | \[static/present,code/count,value/count\] | -| LOS in ICU > 3 days | Admission + 24 hr | 0.946 | 30,443 | 1,624 | \[2h,7d,30d\] | \[static/present,code/count,value/count,value/sum,value/sum_sqd,value/max\] | -| LOS in ICU > 3 days | Admission + 48 hr | 0.967 | 2,864 | 4,332 | \[2h,7d,30d\] | \[code/count,value/sum_sqd,value/max\] | -| LOS in Hospital > 3 days | Admission + 24 hr | 0.943 | 94,633 | 912 | \[12h,1d,7d\] | \[code/count,value/count,value/sum_sqd\] | -| LOS in Hospital > 3 days | Admission + 48 hr | 0.945 | 30,880 | 1,619 | \[1d,7d,30d\] | \[code/count,value/sum,value/min,value/max\] | +| ------------------------------- | ----------------- | ----- | ---------------------------- | -------------------------- | ---------------------- | --------------------------------------------------------------------------- | +| Post-discharge 30 day Mortality | Discharge | 0.935 | 1,371 | 5,712 | \[7d,full\] | \[code/count,value/count,value/min,value/max\] | +| Post-discharge 1 year Mortality | Discharge | 0.898 | 289 | 10,048 | \[2h,12h,1d,30d,full\] | \[static/present,code/count,value/sum_sqd,value/min\] | +| 30 day Readmission | Discharge | 0.708 | 303 | 9,903 | \[30d,365d,full\] | \[code/count,value/count,value/sum,value/sum_sqd,value/max\] | +| In ICU Mortality | Admission + 24 hr | 0.661 | 7,059 | 3,037 | \[12h,full\] | \[static/present,code/count,value/sum,value/min,value/max\] | +| In ICU Mortality | Admission + 48 hr | 0.673 | 71 | 16,112 | \[1d,7d,full\] | \[static/present,code/count,value/sum,value/min,value/max\] | +| In Hospital Mortality | Admission + 24 hr | 0.812 | 43 | 18,989 | \[1d,full\] | \[static/present,code/count,value/sum,value/min,value/max\] | +| In Hospital Mortality | Admission + 48 hr | 0.810 | 678 | 7,433 | \[1d,full\] | \[static/present,code/count,value/count\] | +| LOS in ICU > 3 days | Admission + 24 hr | 0.946 | 30,443 | 1,624 | \[2h,7d,30d\] | \[static/present,code/count,value/count,value/sum,value/sum_sqd,value/max\] | +| LOS in ICU > 3 days | Admission + 48 hr | 0.967 | 2,864 | 4,332 | \[2h,7d,30d\] | \[code/count,value/sum_sqd,value/max\] | +| LOS in Hospital > 3 days | Admission + 24 hr | 0.943 | 94,633 | 912 | \[12h,1d,7d\] | \[code/count,value/count,value/sum_sqd\] | +| LOS in Hospital > 3 days | Admission + 48 hr | 0.945 | 30,880 | 1,619 | \[1d,7d,30d\] | \[code/count,value/sum,value/min,value/max\] | - Number of Included Codes is based on Minimum Code Inclusion Count -- we calculated the number of resulting codes that were above the minimum threshold and reported that. @@ -160,15 +160,15 @@ For more details about eICU specific task generation and running, see the [eICU #### 1. XGBoost Performance on eICU | Task | Index Timestamp | AUC | Minimum Code Inclusion Count | Window Sizes | Aggregations | -| ------------------------------- | ----------------- | ----- | -------------------------------- | ------------------------ | -------------------------------------------------------------- | -| Post-discharge 30 day Mortality | Discharge | 0.603 | 68,235 | \[12h,1d,full\] | \[code/count,value/sum_sqd,value/max\] | -| Post-discharge 1 year Mortality | Discharge | 0.875 | 3,280 | \[30d,365d\] | \[static/present,value/sum,value/sum_sqd,value/min,value/max\] | -| In Hospital Mortality | Admission + 24 hr | 0.855 | 335,912 | \[2h,7d,30d,365d,full\] | \[static/present,code/count,value/count,value/min,value/max\] | -| In Hospital Mortality | Admission + 48 hr | 0.570 | 89,121 | \[12h,1d,30d\] | \[code/count,value/count,value/min\] | -| LOS in ICU > 3 days | Admission + 24 hr | 0.783 | 7,881 | \[1d,30d,full\] | \[static/present,code/count,value/count,value/sum,value/max\] | -| LOS in ICU > 3 days | Admission + 48 hr | 0.757 | 1,719 | \[2h,12h,7d,30d,full\] | \[code/count,value/count,value/sum,value/sum_sqd,value/min\] | -| LOS in Hospital > 3 days | Admission + 24 hr | 0.864 | 160 | \[1d,30d,365d,full\] | \[static/present,code/count,value/min,value/max\] | -| LOS in Hospital > 3 days | Admission + 48 hr | 0.895 | 975 | \[12h,1d,30d,365d,full\] | \[code/count,value/count,value/sum,value/sum_sqd\] | +| ------------------------------- | ----------------- | ----- | ---------------------------- | ------------------------ | -------------------------------------------------------------- | +| Post-discharge 30 day Mortality | Discharge | 0.603 | 68,235 | \[12h,1d,full\] | \[code/count,value/sum_sqd,value/max\] | +| Post-discharge 1 year Mortality | Discharge | 0.875 | 3,280 | \[30d,365d\] | \[static/present,value/sum,value/sum_sqd,value/min,value/max\] | +| In Hospital Mortality | Admission + 24 hr | 0.855 | 335,912 | \[2h,7d,30d,365d,full\] | \[static/present,code/count,value/count,value/min,value/max\] | +| In Hospital Mortality | Admission + 48 hr | 0.570 | 89,121 | \[12h,1d,30d\] | \[code/count,value/count,value/min\] | +| LOS in ICU > 3 days | Admission + 24 hr | 0.783 | 7,881 | \[1d,30d,full\] | \[static/present,code/count,value/count,value/sum,value/max\] | +| LOS in ICU > 3 days | Admission + 48 hr | 0.757 | 1,719 | \[2h,12h,7d,30d,full\] | \[code/count,value/count,value/sum,value/sum_sqd,value/min\] | +| LOS in Hospital > 3 days | Admission + 24 hr | 0.864 | 160 | \[1d,30d,365d,full\] | \[static/present,code/count,value/min,value/max\] | +| LOS in Hospital > 3 days | Admission + 48 hr | 0.895 | 975 | \[12h,1d,30d,365d,full\] | \[code/count,value/count,value/sum,value/sum_sqd\] | #### 2. XGBoost Optimal Found Model Parameters diff --git a/docs/terminology.md b/docs/terminology.md index 59049b0..edeb3f8 100644 --- a/docs/terminology.md +++ b/docs/terminology.md @@ -4,73 +4,74 @@ This document defines key terms used in MEDS-Tab. For complete reference, see th ## Core MEDS Fields -| Field | Definition | -|-------|------------| -| `subject_id` | Unique identifier for each patient | -| `time` | Timestamp when the data was recorded (NULL for static data) | -| `code` | Feature identifier/name | -| `numeric_value` | Measurement value (when applicable) | +| Field | Definition | +| --------------- | ----------------------------------------------------------- | +| `subject_id` | Unique identifier for each patient | +| `time` | Timestamp when the data was recorded (NULL for static data) | +| `code` | Feature identifier/name | +| `numeric_value` | Measurement value (when applicable) | One example of this is referred to as a `measurement` in MEDS-Tab, which is a single row of data with the fields above. For example: + ```yaml -subject_id: "patient_123" -time: "2024-01-15 14:30:00" -code: "HEART_RATE" +subject_id: patient_123 +time: '2024-01-15 14:30:00' +code: HEART_RATE numeric_value: 72.0 ``` -represents a heart rate measurement of 72.0 for patient_123 taken on January 15th, 2024 at 2:30 PM. +represents a heart rate measurement of 72.0 for patient_123 taken on January 15th, 2024 at 2:30 PM. ## Feature Types Measurements in MEDS-Tab are categorized into four types based on whether they include timestamps and numeric values: -| Term | Definition | Examples | -|------|------------|-----------| -| Static Codes | Measurements with no timestamp and no numeric value | Gender, blood type | -| Static Numeric Values | Measurements with no timestamp but including a numeric value | Birth weight, admission height | -| Dynamic Codes | Measurements with a timestamp but no numeric value | Diagnosis codes, medication orders | -| Dynamic Numeric Values | Measurements with both a timestamp and numeric value | Vital signs, lab results | +| Term | Definition | Examples | +| ---------------------- | ------------------------------------------------------------ | ---------------------------------- | +| Static Codes | Measurements with no timestamp and no numeric value | Gender, blood type | +| Static Numeric Values | Measurements with no timestamp but including a numeric value | Birth weight, admission height | +| Dynamic Codes | Measurements with a timestamp but no numeric value | Diagnosis codes, medication orders | +| Dynamic Numeric Values | Measurements with both a timestamp and numeric value | Vital signs, lab results | Note that "Static" and "Dynamic" refer to whether a timestamp is recorded in the MEDS data, not whether the underlying concept can change over time. ## Aggregation Functions -| Aggregation | Applies To | Definition | -|-------------|------------|------------| -| `static/present` | Static Codes | Binary indicator of code presence | -| `static/first` | Static Numeric Values | The numeric value | -| `code/count` | Dynamic Codes | Count of code occurrences within lookback window | -| `value/count` | Dynamic Numeric Values | Count of measurements within lookback window | -| `value/sum` | Dynamic Numeric Values | Sum of measurements within lookback window | -| `value/sum_sqd` | Dynamic Numeric Values | Sum of squared measurements within lookback window | -| `value/min` | Dynamic Numeric Values | Minimum measurement within lookback window | -| `value/max` | Dynamic Numeric Values | Maximum measurement within lookback window | +| Aggregation | Applies To | Definition | +| ---------------- | ---------------------- | -------------------------------------------------- | +| `static/present` | Static Codes | Binary indicator of code presence | +| `static/first` | Static Numeric Values | The numeric value | +| `code/count` | Dynamic Codes | Count of code occurrences within lookback window | +| `value/count` | Dynamic Numeric Values | Count of measurements within lookback window | +| `value/sum` | Dynamic Numeric Values | Sum of measurements within lookback window | +| `value/sum_sqd` | Dynamic Numeric Values | Sum of squared measurements within lookback window | +| `value/min` | Dynamic Numeric Values | Minimum measurement within lookback window | +| `value/max` | Dynamic Numeric Values | Maximum measurement within lookback window | Static aggregations are computed once per subject_id and static code. Dynamic aggregations are computed per subject_id, code, and lookback window, where the lookback window defines the time period before a reference time point over which measurements are aggregated. Note that the value-based aggregations (`value/*`) are only computed for the subset of dynamic code measurements that include numeric values, while `code/count` is computed for all dynamic codes regardless of whether they have numeric values. We provide examples of these aggregations here. Notice that for dynamic aggregations, data within a lookback window (e.g., last 24 hours) is input to the aggregation function. -| Aggregation | Input Data | Result | Explanation | -|-------------|------------|--------|-------------| -| `static/present` | Gender//Female | 1 | Indicates the presence (1) of the code "Gender//Female" | -| `static/first` | Birth Weight: 3.2 kg | 3.2 | Returns the numeric value of the static measurement | -| `code/count` | Heart Rate: [80, NULL, 78, 90] | 4 | Counts the occurrences of codes within the lookback window | -| `value/count` | Heart Rate: [80, 78, 90] | 3 | Counts the number of measurements recorded within the lookback window | -| `value/sum` | Glucose Levels: [100, 110, 105] | 315 | Sums the measurement values within the lookback window | -| `value/sum_sqd` | Blood Pressure Readings: [120, 125] | 30,025 | Sums the squares of the measurements (120² + 125²) | -| `value/min` | Temperature Readings: [37.5, 38.0, 37.0] | 37.0 | Finds the minimum value within the lookback window | -| `value/max` | Respiratory Rate: [16, 18, 20] | 20 | Finds the maximum value within the lookback window | - +| Aggregation | Input Data | Result | Explanation | +| ---------------- | ------------------------------------------ | ------ | --------------------------------------------------------------------- | +| `static/present` | Gender//Female | 1 | Indicates the presence (1) of the code "Gender//Female" | +| `static/first` | Birth Weight: 3.2 kg | 3.2 | Returns the numeric value of the static measurement | +| `code/count` | Heart Rate: \[80, NULL, 78, 90\] | 4 | Counts the occurrences of codes within the lookback window | +| `value/count` | Heart Rate: \[80, 78, 90\] | 3 | Counts the number of measurements recorded within the lookback window | +| `value/sum` | Glucose Levels: \[100, 110, 105\] | 315 | Sums the measurement values within the lookback window | +| `value/sum_sqd` | Blood Pressure Readings: \[120, 125\] | 30,025 | Sums the squares of the measurements (120² + 125²) | +| `value/min` | Temperature Readings: \[37.5, 38.0, 37.0\] | 37.0 | Finds the minimum value within the lookback window | +| `value/max` | Respiratory Rate: \[16, 18, 20\] | 20 | Finds the maximum value within the lookback window | ## Lookback Window We define a lookback window as a time period before a reference time point over which dynamic data is aggregated. By default, we use the lookback windows (defined in [this default hydra config](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/src/MEDS_tabular_automl/configs/tabularization/default.yaml)): + ```yaml window_sizes: - - "1d" # 1 day - - "7d" # 7 days - - "30d" # 30 days - - "365d" # 1 year - - "full" # full subject history + - 1d # 1 day + - 7d # 7 days + - 30d # 30 days + - 365d # 1 year + - full # full subject history ``` diff --git a/docs/usage_guide.md b/docs/usage_guide.md index b36aa32..0a9ca3e 100644 --- a/docs/usage_guide.md +++ b/docs/usage_guide.md @@ -19,16 +19,17 @@ MEDS_transform-reshard_to_split \ ``` ??? note "Args Description" - * `--multirun`: This is an optional argument to specify that the command should be run in parallel. We use this here to parallelize the resharing of the data. - * `hydra/launcher`: This is an optional argument to specify the launcher. When using multirun you should specify the launcher. We use joblib here which enables parallelization on a single machine. - * `worker`: When using joblib or a hydra slurm launcher, the range of workers must be defined as it specifies the number of parallel workers to spawn. We use 6 workers here. - * `input_dir`: The directory containing the MEDS data. - * `cohort_dir`: The directory to store the resharded data. - * `stages`: The stages to run. We only run the reshard_to_split stage here. MEDS Transform allows for a sequence of stages to be defined and run which is why this is a list. - * `stage`: The specific stage to run. We run the reshard_to_split stage here. It must be one of the stages in the `stages` kwarg list. - * `stage_configs.reshard_to_split.n_subjects_per_shard`: The number of subjects per shard. We use 2500 subjects per shard here. + - `--multirun`: This is an optional argument to specify that the command should be run in parallel. We use this here to parallelize the resharing of the data. + - `hydra/launcher`: This is an optional argument to specify the launcher. When using multirun you should specify the launcher. We use joblib here which enables parallelization on a single machine. + - `worker`: When using joblib or a hydra slurm launcher, the range of workers must be defined as it specifies the number of parallel workers to spawn. We use 6 workers here. + - `input_dir`: The directory containing the MEDS data. + - `cohort_dir`: The directory to store the resharded data. + - `stages`: The stages to run. We only run the reshard_to_split stage here. MEDS Transform allows for a sequence of stages to be defined and run which is why this is a list. + - `stage`: The specific stage to run. We run the reshard_to_split stage here. It must be one of the stages in the `stages` kwarg list. + - `stage_configs.reshard_to_split.n_subjects_per_shard`: The number of subjects per shard. We use 2500 subjects per shard here. ### Input Data Structure + ```text MEDS_DIR/ │ @@ -44,6 +45,7 @@ MEDS_DIR/ ``` ### Output Data Structure (New Files) + ```text MEDS_RESHARD_DIR/ │ @@ -59,21 +61,25 @@ MEDS_RESHARD_DIR/ ``` ### Complete Directory Structure + !!! abstract "Stage 0 Directory Structure" ??? folder "MEDS_DIR" ??? folder "SPLIT A" - * 📄 SHARD 0.parquet + - 📄 SHARD 0.parquet + ??? folder "SPLIT B" - * 📄 SHARD 0.parquet + - 📄 SHARD 0.parquet + ??? folder "MEDS_RESHARD_DIR" ??? folder "SPLIT A" - * 📄 SHARD 0.parquet - * 📄 SHARD 1.parquet - * 📄 ... + - 📄 SHARD 0.parquet + - 📄 SHARD 1.parquet + - 📄 ... + ??? folder "SPLIT B" - * 📄 SHARD 0.parquet - * 📄 SHARD 1.parquet - * 📄 ... + - 📄 SHARD 0.parquet + - 📄 SHARD 1.parquet + - 📄 ... For the rest of the tutorial we will assume that the data has been reshared into the `MEDS_RESHARD_DIR` directory, but this step is optional, and you could instead use the original data directory, `MEDS_DIR`. If you experience high memory issues in later stages, you should try reducing `stage_configs.reshard_to_split.n_subjects_per_shard` to a smaller number. @@ -81,10 +87,10 @@ For the rest of the tutorial we will assume that the data has been reshared into This command processes MEDS data shards to compute the frequencies of different code types. It differentiates codes into the following categories: -* dynamic codes (codes with timestamps) -* dynamic numeric values (codes with timestamps and numerical values) -* static codes (codes without timestamps) -* static numeric values (codes without timestamps but with numerical values) +- dynamic codes (codes with timestamps) +- dynamic numeric values (codes with timestamps and numerical values) +- static codes (codes without timestamps) +- static numeric values (codes without timestamps but with numerical values) This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `OUTPUT_DIR` argument specified as a hydra-style command line argument. @@ -96,10 +102,11 @@ meds-tab-describe \ This stage is not parallelized as it runs very quickly. ??? note "Args Description" - * `input_dir`: The directory containing the MEDS data. - * `output_dir`: The directory to store the tabularized data. + - `input_dir`: The directory containing the MEDS data. + - `output_dir`: The directory to store the tabularized data. ### Input Data Structure + ```text MEDS_RESHARD_DIR/ │ @@ -115,6 +122,7 @@ MEDS_RESHARD_DIR/ ``` ### Output Data Structure (New Files) + ```text OUTPUT_DIR/ │ @@ -123,24 +131,29 @@ OUTPUT_DIR/ ``` ### Complete Directory Structure + !!! abstract "Stage 1 Directory Structure" ??? folder "MEDS_DIR" ??? folder "SPLIT A" - * 📄 SHARD 0.parquet + - 📄 SHARD 0.parquet + ??? folder "SPLIT B" - * 📄 SHARD 0.parquet + - 📄 SHARD 0.parquet + ??? folder "MEDS_RESHARD_DIR" ??? folder "SPLIT A" - * 📄 SHARD 0.parquet - * 📄 SHARD 1.parquet - * 📄 ... + - 📄 SHARD 0.parquet + - 📄 SHARD 1.parquet + - 📄 ... + ??? folder "SPLIT B" - * 📄 SHARD 0.parquet - * 📄 SHARD 1.parquet - * 📄 ... + - 📄 SHARD 0.parquet + - 📄 SHARD 1.parquet + - 📄 ... + ??? folder "OUTPUT_DIR" ??? folder "metadata" - * 📄 codes.parquet + - 📄 codes.parquet ## 3. **`meds-tab-tabularize-static`** @@ -159,27 +172,29 @@ meds-tab-tabularize-static \ This stage is not parallelized as it runs very quickly. ??? note "Args Description" - * `input_dir`: The directory containing the MEDS data. - * `output_dir`: The directory to store the tabularized data. - * `tabularization.min_code_inclusion_count`: The minimum number of times a code must appear. - * `tabularization.window_sizes`: The window sizes to use for aggregations. - * `do_overwrite`: Whether to overwrite existing files. - * `tabularization.aggs`: The aggregation methods to use. + - `input_dir`: The directory containing the MEDS data. + - `output_dir`: The directory to store the tabularized data. + - `tabularization.min_code_inclusion_count`: The minimum number of times a code must appear. + - `tabularization.window_sizes`: The window sizes to use for aggregations. + - `do_overwrite`: Whether to overwrite existing files. + - `tabularization.aggs`: The aggregation methods to use. !!! note "Code Inclusion Parameters" In addition to `min_code_inclusion_count` there are several other parameters that can be set in tabularization to restrict the codes that are included: - * `allowed_codes`: a list of codes to include in the tabularized data - * `min_code_inclusion_count`: The minimum number of times a code must appear - * `min_code_inclusion_frequency`: The minimum normalized frequency required - * `max_included_codes`: The maximum number of codes to include + - `allowed_codes`: a list of codes to include in the tabularized data + - `min_code_inclusion_count`: The minimum number of times a code must appear + - `min_code_inclusion_frequency`: The minimum normalized frequency required + - `max_included_codes`: The maximum number of codes to include ### Input Data Structure + ```text [Previous structure remains the same] ``` ### Output Data Structure (New Files) + ```text OUTPUT_DIR/ └─── tabularize/ @@ -197,36 +212,44 @@ OUTPUT_DIR/ ``` ### Complete Directory Structure After Static Tabularization + !!! abstract "Stage 3 Directory Structure" ??? folder "MEDS_DIR" ??? folder "SPLIT A" - * 📄 SHARD 0.parquet + - 📄 SHARD 0.parquet + ??? folder "SPLIT B" - * 📄 SHARD 0.parquet + - 📄 SHARD 0.parquet + ??? folder "MEDS_RESHARD_DIR" ??? folder "SPLIT A" - * 📄 SHARD 0.parquet - * 📄 SHARD 1.parquet - * 📄 ... + - 📄 SHARD 0.parquet + - 📄 SHARD 1.parquet + - 📄 ... + ??? folder "SPLIT B" - * 📄 SHARD 0.parquet - * 📄 SHARD 1.parquet - * 📄 ... + - 📄 SHARD 0.parquet + - 📄 SHARD 1.parquet + - 📄 ... + ??? folder "OUTPUT_DIR" ??? folder "metadata" - * 📄 codes.parquet + - 📄 codes.parquet + ??? folder "tabularize" ??? folder "SPLIT A" ??? folder "SHARD 0" ??? folder "none/static" - * 📄 present.npz - * 📄 first.npz + - 📄 present.npz + - 📄 first.npz + ??? folder "SHARD 1" ??? folder "none/static" - * 📄 present.npz - * 📄 first.npz + - 📄 present.npz + - 📄 first.npz + ??? folder "SPLIT B" - [Similar structure to SPLIT A] + \[Similar structure to SPLIT A\] ## 4. **`meds-tab-tabularize-time-series`** @@ -251,11 +274,13 @@ meds-tab-tabularize-time-series \ You must use the same code inclusion parameters (which in this example is just `tabularization.min_code_inclusion_count`) as in the previous stage, `meds-tab-tabularize-static`, to ensure that the same codes are included in the tabularized data. ### Input Data Structure + ```text [Previous structure remains the same] ``` ### Output Data Structure (New Files) + ```text OUTPUT_DIR/tabularize/ │ @@ -275,34 +300,44 @@ OUTPUT_DIR/tabularize/ ``` ### Complete Directory Structure + !!! abstract "Stage 4 Directory Structure" ??? folder "MEDS_DIR" - [Previous structure] + \[Previous structure\] + ??? folder "MEDS_RESHARD_DIR" - [Previous structure] + \[Previous structure\] + ??? folder "OUTPUT_DIR" ??? folder "metadata" - * 📄 codes.parquet + - 📄 codes.parquet + ??? folder "tabularize" ??? folder "SPLIT A" ??? folder "SHARD 0" ??? folder "none/static" - * 📄 present.npz - * 📄 first.npz + - 📄 present.npz + - 📄 first.npz + ??? folder "1d" ??? folder "code" - * 📄 count.npz + - 📄 count.npz + ??? folder "value" - * 📄 sum.npz + - 📄 sum.npz + ??? folder "7d" ??? folder "code" - * 📄 count.npz + - 📄 count.npz + ??? folder "value" - * 📄 sum.npz + - 📄 sum.npz + ??? folder "SHARD 1" - [Similar structure to SHARD 0] + \[Similar structure to SHARD 0\] + ??? folder "SPLIT B" - [Similar structure to SPLIT A] + \[Similar structure to SPLIT A\] ## 5. **`meds-tab-cache-task`** @@ -329,6 +364,7 @@ meds-tab-cache-task \ You must use the same code inclusion parameters (which in this example is just `tabularization.min_code_inclusion_count`) as in the previous stages to ensure that the same codes are included in the tabularized data. ### Input Data Structure + ```text # Previous structure plus: TASKS_DIR/ @@ -337,6 +373,7 @@ TASKS_DIR/ ``` ### Output Data Structure (New Files) + ```text OUTPUT_DIR/ └─── TASK/ @@ -352,50 +389,64 @@ OUTPUT_DIR/ ``` ### Complete Directory Structure + !!! abstract "Stage 5 Directory Structure" ??? folder "MEDS_DIR" - [Previous structure] + \[Previous structure\] + ??? folder "MEDS_RESHARD_DIR" - [Previous structure] + \[Previous structure\] + ??? folder "OUTPUT_DIR" ??? folder "metadata" - * 📄 codes.parquet + - 📄 codes.parquet + ??? folder "tabularize" - [Previous structure] + \[Previous structure\] + ??? folder "${TASK}" ??? folder "labels" ??? folder "SPLIT A" - * 📄 SHARD 0.parquet - * 📄 SHARD 1.parquet + - 📄 SHARD 0.parquet + - 📄 SHARD 1.parquet + ??? folder "SPLIT B" - * 📄 SHARD 0.parquet - * 📄 SHARD 1.parquet + - 📄 SHARD 0.parquet + - 📄 SHARD 1.parquet + ??? folder "task_cache" ??? folder "SPLIT A" ??? folder "SHARD 0" ??? folder "none/static" - * 📄 present.npz - * 📄 first.npz + - 📄 present.npz + - 📄 first.npz + ??? folder "1d" ??? folder "code" - * 📄 count.npz + - 📄 count.npz + ??? folder "value" - * 📄 sum.npz + - 📄 sum.npz + ??? folder "7d" ??? folder "code" - * 📄 count.npz + - 📄 count.npz + ??? folder "value" - * 📄 sum.npz + - 📄 sum.npz + ??? folder "SHARD 1" - [Similar structure to SHARD 0] + \[Similar structure to SHARD 0\] + ??? folder "SPLIT B" - [Similar structure to SPLIT A] + \[Similar structure to SPLIT A\] ## 6. **`meds-tab-model`** Trains a tabular model using user-specified parameters. The system incorporates extended memory support through sequential shard loading during training and efficient data loading through custom iterators. ### Single Model Training + ```bash meds-tab-model \ model_launcher=xgboost \ @@ -409,6 +460,7 @@ meds-tab-model \ ``` ### Hyperparameter Optimization + ```bash meds-tab-model \ --multirun \ @@ -425,39 +477,40 @@ meds-tab-model \ ``` ??? note "Args Description for Model Stage" - * `model_launcher`: Choose from `xgboost`, `knn_classifier`, `logistic_regression`, `random_forest_classifier`, `sgd_classifier` - * `input_dir`: The directory containing the MEDS data - * `output_dir`: The directory storing tabularized data - * `output_model_dir`: Where to save model outputs - * `hydra.sweeper.n_trials`: Number of trials for hyperparameter optimization - * `hydra.sweeper.n_jobs`: Number of parallel jobs for optimization + - `model_launcher`: Choose from `xgboost`, `knn_classifier`, `logistic_regression`, `random_forest_classifier`, `sgd_classifier` + - `input_dir`: The directory containing the MEDS data + - `output_dir`: The directory storing tabularized data + - `output_model_dir`: Where to save model outputs + - `hydra.sweeper.n_trials`: Number of trials for hyperparameter optimization + - `hydra.sweeper.n_jobs`: Number of parallel jobs for optimization ??? note "Code Inclusion Parameters in Modeling" In this modeling stage, you can change the code inclusion parameters from previous stages and treat them as tunable hyperparameters. Additional task-specific parameters include: - * `min_correlation`: Minimum correlation with target required - * `max_by_correlation`: Maximum number of codes to include based on correlation with target + - `min_correlation`: Minimum correlation with target required + - `max_by_correlation`: Maximum number of codes to include based on correlation with target ??? note "Data Preprocessing Options" - * **Tree-based methods** (e.g., XGBoost): - * Insensitive to normalization - * Generally don't benefit from missing value imputation - * XGBoost handles missing data natively - * **Other supported models**: - * Support sparse matrices - * May benefit from normalization or imputation + - **Tree-based methods** (e.g., XGBoost): + - Insensitive to normalization + - Generally don't benefit from missing value imputation + - XGBoost handles missing data natively + - **Other supported models**: + - Support sparse matrices + - May benefit from normalization or imputation Available preprocessing options: - * *Normalization* (maintains sparsity): - * `standard_scaler` - * `max_abs_scaler` - * *Imputation* (converts to dense format): - * `mean_imputer` - * `median_imputer` - * `mode_imputer` + - *Normalization* (maintains sparsity): + - `standard_scaler` + - `max_abs_scaler` + - *Imputation* (converts to dense format): + - `mean_imputer` + - `median_imputer` + - `mode_imputer` ### Input/Output Data Structure + ```text [Previous structure remains the same for input] @@ -478,31 +531,39 @@ OUTPUT_MODEL_DIR/ ``` ### Complete Directory Structure + !!! abstract "Final Directory Structure" ??? folder "MEDS_DIR" - [Previous structure] + \[Previous structure\] + ??? folder "MEDS_RESHARD_DIR" - [Previous structure] + \[Previous structure\] + ??? folder "OUTPUT_DIR" - [Previous structure] + \[Previous structure\] + ??? folder "OUTPUT_MODEL_DIR" ??? folder "TASK/YYYY-MM-DD_HH-MM-SS" ??? folder "best_trial" - * 📄 config.log - * 📄 performance.log - * 📄 xgboost.json + - 📄 config.log + - 📄 performance.log + - 📄 xgboost.json + ??? folder "hydra" - * 📄 optimization_results.yaml + - 📄 optimization_results.yaml + ??? folder "sweep_results" ??? folder "TRIAL_1_ID" - * 📄 config.log - * 📄 performance.log - * 📄 xgboost.json + - 📄 config.log + - 📄 performance.log + - 📄 xgboost.json + ??? folder "TRIAL_2_ID" - [Similar structure to TRIAL_1_ID] + \[Similar structure to TRIAL_1_ID\] ??? example "Experimental Feature" We also support an autogluon based hyperparameter and model search: + ```bash meds-tab-autogluon model_launcher=autogluon \ "input_dir=${MEDS_RESHARD_DIR}/data" \ @@ -510,4 +571,5 @@ OUTPUT_MODEL_DIR/ "output_model_dir=${OUTPUT_MODEL_DIR}/${TASK}/" \ "task_name=$TASK" ``` + Run `meds-tab-autogluon model_launcher=autogluon --help` to see all kwargs. Autogluon requires a lot of memory as it makes all the sparse matrices dense, and is not recommended for large datasets.