mmcdermott · Oufattole · Jun 5, 2024 · May 25, 2024 · May 26, 2024 · May 26, 2024
diff --git a/.github/workflows/tests.yaml b/.github/workflows/tests.yaml
@@ -19,23 +19,21 @@ jobs:
       - name: Checkout
         uses: actions/checkout@v3
 
-      - name: Set up Python 3.11
+      - name: Set up Python 3.12
         uses: actions/setup-python@v3
         with:
-          python-version: "3.11"
+          python-version: "3.12"
 
       - name: Install packages
         run: |
-          pip install -e .
-          pip install pytest
-          pip install pytest-cov[toml]
+          pip install -e .[tests]
 
       #----------------------------------------------
       #              run test suite
       #----------------------------------------------
       - name: Run tests
         run: |
-          pytest -v --doctest-modules --cov
+          pytest -v --doctest-modules --cov --ignore=hf_cohort/
 
       - name: Upload coverage to Codecov
         uses: codecov/[email protected]

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -38,6 +38,7 @@ repos:
     rev: v2.2.0
     hooks:
       - id: autoflake
+        args: [--in-place, --remove-all-unused-imports]
 
   # python upgrading syntax to newer version
   - repo: https://github.com/asottile/pyupgrade

diff --git a/README.md b/README.md
@@ -1,8 +1,10 @@
 # Scalable tabularization and tabular feature usage utilities over generic MEDS datasets
+
 This repository provides utilities and scripts to run limited automatic tabular ML pipelines for generic MEDS
 datasets.
 
 #### Q1: What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?
+
 This is a common misconception. _Tabular_ data refers to data that can be organized in a consistent, logical
 set of rows/columns such that the entirety of a "sample" or "instance" for modeling or analysis is contained
 in a single row, and the set of columns possibly observed (there can be missingness) is consistent across all
@@ -15,28 +17,62 @@ or future windows in time to produce a single row per patient with a consistent,
 (though there may still be missingness).
 
 #### Q2: Why not other systems?
-  - [TemporAI](https://github.com/vanderschaarlab/temporai) is the most natural competitor, and already
-    supports AutoML capabilities. However, TemporAI (as of now) does not support generic MEDS datasets, and it
-    is not clear if their AutoML systems will scale to the size of datasets we need to support. But, further
-    investigation is needed, and it may be the case that the best solution here is simply to write a custom
-    data source for MEDS data within TemporAI and leverage their tools.
+
+- [TemporAI](https://github.com/vanderschaarlab/temporai) is the most natural competitor, and already
+  supports AutoML capabilities. However, TemporAI (as of now) does not support generic MEDS datasets, and it
+  is not clear if their AutoML systems will scale to the size of datasets we need to support. But, further
+  investigation is needed, and it may be the case that the best solution here is simply to write a custom
+  data source for MEDS data within TemporAI and leverage their tools.
 
 # Installation
+
 Clone this repository and install the requirements by running `pip install .` in the root directory.
 
 # Usage
+
 This repository consists of two key pieces:
-  1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing
-     patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) either backwards or
-     forwards in time from a given index date. Naturally, only "look-back" windows should be used for
-     future-event prediction tasks; however, the capability to summarize "look-ahead" windows is also useful
-     for characterizing and describing the differences between patient populations statistically.
-  2. Running basic AutoML pipelines over these tabular features to predict arbitrary binary classification
-     downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced --
-     what is more advanced is the efficient construction, storage, and loading of tabular features for the
-     candidate AutoML models, enabling a far more extensive search over different featurization strategies.
+
+1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing
+   patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) either backwards or
+   forwards in time from a given index date. Naturally, only "look-back" windows should be used for
+   future-event prediction tasks; however, the capability to summarize "look-ahead" windows is also useful
+   for characterizing and describing the differences between patient populations statistically.
+2. Running basic AutoML pipelines over these tabular features to predict arbitrary binary classification
+   downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced --
+   what is more advanced is the efficient construction, storage, and loading of tabular features for the
+   candidate AutoML models, enabling a far more extensive search over different featurization strategies.
+
+### Scripts and Examples
+
+See `tests/test_tabularize_integration.py` for an example of the end-to-end pipeline being run on synthetic data. This
+script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.
+
+#### Core Scripts:
+
+1. `scripts/identify_columns.py` loads all training shard to identify which feature columns
+   to generate tabular data for.
+2. `scripts/tabularize_static.py` Iterates through shards and generates tabular vectors for
+   each patient. There is a single row per patient for each shard.
+3. `scripts/summarize_over_windows.py` For each shard, iterates through window sizes and aggregations to and
+   horizontally concatenates the outputs to generate the final tabular representations at every event time for
+   every patient.
+4. `scripts/tabularize_merge` Aligns the time-series window aggregations (generated in the previous step) with
+   the static tabular vectors and caches them for training.
+5. `scripts/hf_cohort/aces_task_extraction.py` Generates the task labels and caches them with the event_id
+   indexes which align them with the nearest prior event in the tabular data.
+6. `scripts/xgboost_sweep.py` Tunes XGboost on methods. Iterates through the labels and corresponding tabular data.
+
+We run this on an example dataset using the following bash scripts in sequence:
+
+```bash
+bash hf_cohort_shard.sh  # processes the dataset into meds format
+bash hf_cohort_e2e.sh  # performs (steps 1-4 above)
+bash hf_cohort/aces_task.sh  # generates labels (step 5)
+bash xgboost.sh  # trains xgboos (step 6)
+```
 
 ## Feature Construction, Storage, and Loading
+
 Tabularization of a (raw) MEDS dataset is done by running the `scripts/data/tabularize.py` script. This script
 must inherently do a base level of preprocessing over the MEDS data, then will construct a sharded tabular
 representation that respects the overall sharding of the raw data. This script uses [Hydra](https://hydra.cc/)
@@ -45,14 +81,39 @@ to manage configuration, and the configuration file is located at `configs/tabul
 ## AutoML Pipelines
 
 # TODOs
-  1. Leverage the "event bound aggregation" capabilities of [ESGPT Task
-     Select](https://github.com/justin13601/ESGPTTaskQuerying/) to construct tabular summary features for
-     event-bound historical windows (e.g., until the prior admission, until the last diagnosis of some type,
-     etc.).
-  2. Support more feature aggregation functions.
-  3. Probably rename this repository, as the focus is really more on the tabularization and feature usage
-     utilities than on the AutoML pipelines themselves.
-  4. Import, rather than reimplement, the mapper utilities from the MEDS preprocessing repository.
-  5. Investigate the feasibility of using TemporAI for this task.
-  6. Consider splitting the feature construction and AutoML pipeline parts of this repository into separate
-     repositories.
+
+1. Leverage the "event bound aggregation" capabilities of [ESGPT Task
+   Select](https://github.com/justin13601/ESGPTTaskQuerying/) to construct tabular summary features for
+   event-bound historical windows (e.g., until the prior admission, until the last diagnosis of some type,
+   etc.).
+2. Support more feature aggregation functions.
+3. Probably rename this repository, as the focus is really more on the tabularization and feature usage
+   utilities than on the AutoML pipelines themselves.
+4. Import, rather than reimplement, the mapper utilities from the MEDS preprocessing repository.
+5. Investigate the feasibility of using TemporAI for this task.
+6. Consider splitting the feature construction and AutoML pipeline parts of this repository into separate
+   repositories.
+
+# YAML Configuration File
+
+- `MEDS_cohort_dir`: directory of MEDS format dataset that is ingested.
+- `tabularized_data_dir`: output directory of tabularized data.
+- `min_code_inclusion_frequency`: The base feature inclusion frequency that should be used to dictate
+  what features can be included in the flat representation. It can either be a float, in which
+  case it applies across all measurements, or `None`, in which case no filtering is applied, or
+  a dictionary from measurement type to a float dictating a per-measurement-type inclusion
+  cutoff.
+- `window_sizes`: Beyond writing out a raw, per-event flattened representation, the dataset also has
+  the capability to summarize these flattened representations over the historical windows
+  specified in this argument. These are strings specifying time deltas, using this syntax:
+  `link`\_. Each window size will be summarized to a separate directory, and will share the same
+  subject file split as is used in the raw representation files.
+- `codes`: A list of codes to include in the flat representation. If `None`, all codes will be included
+  in the flat representation.
+- `aggs`: A list of aggregations to apply to the raw representation. Must have length greater than 0.
+- `n_patients_per_sub_shard`: The number of subjects that should be included in each output file.
+  Lowering this number increases the number of files written, making the process of creating and
+  leveraging these files slower but more memory efficient.
+- `do_overwrite`: If `True`, this function will overwrite the data already stored in the target save
+  directory.
+- `seed`: The seed to use for random number generation.
diff --git a/configs/tabularize.yaml b/configs/tabularize.yaml
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,27 +1,38 @@
-[build-system]
-requires = ["setuptools>=61.0"]
-build-backend = "setuptools.build_meta"
-
 [project]
 name = "MEDS_tabularization"
 version = "0.0.1"
 authors = [
   { name="Matthew McDermott", email="[email protected]" },
+  { name="Nassim Oufattole", email="[email protected]" },
+  { name="Teya Bergamaschi", email="[email protected]" },
 ]
-description = "TODO"
+description = "Scalable Tabularization of MEDS format Time-Series data"
 readme = "README.md"
 requires-python = ">=3.12"
 classifiers = [
     "Programming Language :: Python :: 3",
     "License :: OSI Approved :: MIT License",
     "Operating System :: OS Independent",
 ]
-dependencies = ["polars", "pyarrow", "loguru", "hydra-core", "numpy"]
+dependencies = ["polars", "pyarrow", "loguru", "hydra-core", "numpy", "scipy", "pandas", "tqdm", "xgboost", "scikit-learn", "hydra-optuna-sweeper", "hydra-joblib-launcher", "ml-mixins"]
+
+[project.scripts]
+meds-tab-describe = "MEDS_tabular_automl.scripts.describe_codes:main"
+meds-tab-tabularize-static = "MEDS_tabular_automl.scripts.tabularize_static:main"
+meds-tab-tabularize-time-series = "MEDS_tabular_automl.scripts.tabularize_time_series:main"
+meds-tab-cache-task = "MEDS_tabular_automl.scripts.cache_task:main"
+meds-tab-xgboost = "MEDS_tabular_automl.scripts.launch_xgboost:main"
+meds-tab-xgboost-sweep = "MEDS_tabular_automl.scripts.sweep_xgboost:main"
 
 [project.optional-dependencies]
 dev = ["pre-commit"]
 tests = ["pytest", "pytest-cov", "rootutils"]
+profiling = ["mprofile", "matplotlib"]
+
+[build-system]
+requires = ["setuptools>=61.0", "setuptools-scm>=8.0", "wheel"]
+build-backend = "setuptools.build_meta"
 
 [project.urls]
-Homepage = "https://github.com/mmcdermott/MEDS_polars_functions"
-Issues = "https://github.com/mmcdermott/MEDS_polars_functions/issues"
+Homepage = "https://github.com/mmcdermott/MEDS_Tabular_AutoML"
+Issues = "https://github.com/mmcdermott/MEDS_Tabular_AutoML/issues"
diff --git a/src/MEDS_tabular_automl/configs/__init__.py b/src/MEDS_tabular_automl/configs/__init__.py
diff --git a/src/MEDS_tabular_automl/configs/default.yaml b/src/MEDS_tabular_automl/configs/default.yaml
@@ -0,0 +1,17 @@
+MEDS_cohort_dir: ???
+do_overwrite: False
+seed: 1
+tqdm: False
+worker: 0
+loguru_init: False
+
+log_dir: ${output_dir}/.logs/
+
+hydra:
+  verbose: False
+  job:
+    name: MEDS_TAB_${name}_${worker}_{now:%Y-%m-%d_%H-%M-%S}
+  sweep:
+    dir: ${log_dir}
+  run:
+    dir: ${log_dir}
diff --git a/src/MEDS_tabular_automl/configs/describe_codes.yaml b/src/MEDS_tabular_automl/configs/describe_codes.yaml
@@ -0,0 +1,14 @@
+defaults:
+  - default
+  - _self_
+
+# split we wish to get metadata for
+split: train
+# Raw data, must have a subdirectory "train" with the training data split
+input_dir: ${MEDS_cohort_dir}/final_cohort/${split}
+# Where to store output code frequency data
+cache_dir: ${MEDS_cohort_dir}/.cache
+output_dir: ${MEDS_cohort_dir}
+output_filepath: ${output_dir}/code_metadata.parquet
+
+name: describe_codes
diff --git a/src/MEDS_tabular_automl/configs/launch_xgboost.yaml b/src/MEDS_tabular_automl/configs/launch_xgboost.yaml
@@ -0,0 +1,81 @@
+defaults:
+  - default
+  - tabularization: default
+  - _self_
+
+task_name: task
+# min code frequency used for modeling, can potentially sweep over different values.
+modeling_min_code_freq: 10
+
+# Task cached data dir
+input_dir: ${MEDS_cohort_dir}/${task_name}/task_cache
+# Directory with task labels
+input_label_dir: ${MEDS_cohort_dir}/${task_name}/labels
+# Where to output the model and cached data
+output_dir: ${MEDS_cohort_dir}/model/model_${now:%Y-%m-%d_%H-%M-%S}
+output_filepath: ${output_dir}/model_metadata.parquet
+cache_dir: ${MEDS_cohort_dir}/.cache
+
+# Model parameters
+model_params:
+  num_boost_round: 1000
+  early_stopping_rounds: 5
+  model:
+    booster: gbtree
+    device: cpu
+    nthread: 1
+    tree_method: hist
+    objective: binary:logistic
+  iterator:
+    keep_data_in_memory: True
+    binarize_task: True
+
+# Define search space for Optuna
+optuna:
+  study_name: xgboost_sweep_${now:%Y-%m-%d_%H-%M-%S}
+  storage: null
+  load_if_exists: False
+  direction: minimize
+  sampler: null
+  pruner: null
+
+  n_trials: 10
+  n_jobs: 1
+  show_progress_bar: False
+
+  params:
+    suggest_categorical:
+      window_sizes: ${generate_permutations:${tabularization.window_sizes}}
+      aggs: ${generate_permutations:${tabularization.aggs}}
+    suggest_float:
+      eta:
+        low: .001
+        high: 1
+        log: True
+      lambda:
+        low: .001
+        high: 1
+        log: True
+      alpha:
+        low: .001
+        high: 1
+        log: True
+      subsample:
+        low: 0.5
+        high: 1
+      min_child_weight:
+        low: 1e-2
+        high: 100
+    suggest_int:
+      num_boost_round:
+        low: 10
+        high: 1000
+      max_depth:
+        low: 2
+        high: 16
+      min_code_inclusion_frequency:
+        low: 10
+        high: 1_000_000
+        log: True
+
+name: launch_xgboost
diff --git a/src/MEDS_tabular_automl/configs/tabularization.yaml b/src/MEDS_tabular_automl/configs/tabularization.yaml
@@ -0,0 +1,12 @@
+defaults:
+  - default
+  - tabularization: default
+  - _self_
+
+# Raw data
+# Where the code metadata is stored
+input_code_metadata_fp: ${MEDS_cohort_dir}/code_metadata.parquet
+input_dir: ${MEDS_cohort_dir}/final_cohort
+output_dir: ${MEDS_cohort_dir}/tabularize
+
+name: tabularization
diff --git a/src/MEDS_tabular_automl/configs/tabularization/__init__.py b/src/MEDS_tabular_automl/configs/tabularization/__init__.py