From 45cdda704f3bc890a1998564bdf672d0c2ce2224 Mon Sep 17 00:00:00 2001 From: Nassim Oufattole Date: Tue, 22 Oct 2024 22:04:52 -0400 Subject: [PATCH] Restructured overview and user guide (previously named implementation). Added file structure dropdowns in the usage guide to allow users to interact with the file structure. --- docs/implementation.md | 241 ------------- docs/javascripts/directory-tree.js | 13 + docs/overview.md | 303 ++--------------- docs/stylesheets/directory-tree.css | 58 ++++ docs/usage_guide.md | 510 ++++++++++++++++++++++++++++ mkdocs.yml | 6 +- 6 files changed, 612 insertions(+), 519 deletions(-) delete mode 100644 docs/implementation.md create mode 100644 docs/javascripts/directory-tree.js create mode 100644 docs/stylesheets/directory-tree.css create mode 100644 docs/usage_guide.md diff --git a/docs/implementation.md b/docs/implementation.md deleted file mode 100644 index e74addf..0000000 --- a/docs/implementation.md +++ /dev/null @@ -1,241 +0,0 @@ -# The MEDS-Tab Architecture - -MEDS-Tab is designed to address two key challenges in healthcare machine learning: (1) efficiently tabularizing large-scale electronic health record (EHR) data and (2) training competitive baseline models on this tabularized data. This document outlines the architecture and implementation details of MEDS-Tab's pipeline. - -## Overview - -The MEDS-Tab pipeline consists of six main stages, with the first (stage 0) being optional: - -0. Data Resharding (Optional) -1. Data Description (Code Frequency Analysis) -2. Static Data Tabularization -3. Time-Series Data Tabularization -4. Task-Specific Data Caching -5. Model Training - -Each stage is designed with scalability and efficiency in mind, using sparse matrix operations and data sharding to handle large-scale medical datasets. - -## Stage 0: Data Resharding (Optional) - -This optional preliminary stage helps optimize data processing by restructuring the input data into manageable shards. Resharding is particularly useful when dealing with large datasets or when experiencing memory constraints. The process uses the MEDS_transform-reshard_to_split command and supports parallel processing via Hydra's joblib launcher, with configurable shard sizes based on number of subjects. - -Consider resharding if you're experiencing memory issues in later stages, need to process very large datasets, want to enable efficient parallel processing, or have uneven distribution of data across existing shards. - -### Output Structure -```text -/PATH/TO/MEDS_RESHARD_DIR -│ -└─── -│ │ .parquet -│ │ .parquet -│ │ ... -│ -└─── - │ .parquet - │ .parquet - │ ... -``` - -## Stage 1: Data Description - -The first stage analyzes the MEDS data to compute code frequencies and categorize features. This information is crucial for subsequent feature selection and optimization. The implementation iterates through data shards to compute feature frequencies and categorizes codes into dynamic codes (codes with timestamps), dynamic numeric values (codes with timestamps and numerical values), static codes (codes without timestamps), and static numeric values (codes without timestamps but with numerical values). Results are stored in a `${output_dir}/metadata/codes.parquet` file for use in subsequent stages, where `output_dir` is a key word argument. - -### Input Data Structure -```text -/PATH/TO/MEDS/DATA -│ -└─── -│ │ .parquet -│ │ .parquet -│ │ ... -│ -└─── - │ .parquet - │ .parquet - │ ... -``` - -## Stage 2: Static Data Tabularization - -This stage processes static patient data (data without timestamps) into a format suitable for modeling. The implementation uses a dense pivot operations which because static data is generally relatively small. Then this stage converts the data to a sparse matrix format for consistency with time-series data. At first there is a single row for each `subject_id` with their static data. This is are duplicated by the number of unique times the patient has data to align with time-series events, and processing over shards is performed serially due to the manageable size of static data. - -### Input Data Structure -```text -/PATH/TO/MEDS/DATA -│ -└─── -│ │ .parquet -│ │ .parquet -│ │ ... -│ -└─── - │ .parquet - │ .parquet - │ ... -``` - -### Output Data Structure -```text -${output_dir}/tabularize/ -│ -└─── -│ │ /none/static/present.npz -│ │ /none/static/first.npz -│ │ /none/static/present.npz -│ │ ... -│ -└─── - │ /none/static/present.npz - │ /none/static/first.npz - │ /none/static/present.npz - │ ... -``` - -Note that `.../none/static/present.npz` represents the tabularized data for static features with the aggregation method `static/present`. The `.../none/static/first.npz` represents the tabularized data for static features with the aggregation method `static/first`. - -## Stage 3: Time-Series Data Tabularization - -This stage handles the computationally intensive task of converting temporal medical data into feature vectors. The process employs several key optimizations: sparse matrix operations utilizing scipy.sparse for memory-efficient storage of sparse non-zero elements, data sharding that processes data in patient-based shards and enables parallel processing, and efficient aggregation using Polars for fast rolling window computations. - -The process flow begins by loading shard data into a Polars DataFrame, converting it to sparse matrix format where rows represent events and columns represent features. It then aggregates same-day events per patient, applies rolling window aggregations, and stores results in sparse coordinate format (.npz files). - -### Input Data Structure -```text -/PATH/TO/MEDS/DATA -│ -└─── -│ │ .parquet -│ │ .parquet -│ │ ... -│ -└─── - │ .parquet - │ .parquet - │ ... -``` - -### Output Data Structure -```text -${output_dir}/tabularize/ -│ -└─── -│ │ /1d/code/count.npz -│ │ /1d/value/sum.npz -| | ... -| | /7d/code/count.npz -│ │ /7d/value/sum.npz -│ │ ... -| | /1d/code/count.npz -│ │ /1d/value/sum.npz -│ │ ... -│ -└─── - │ ... -``` - -The output structure consists of a directory for each split, containing subdirectories for each shard. Each shard subdirectory contains subdirectories for each aggregation method and window size, with the final output files stored in sparse coordinate format (.npz). In this example we have shown the output for the `1d` and `7d` window sizes and `code/count` and `value/sum` aggregation methods. - -## Stage 4: Task-Specific Data Caching - -This stage aligns tabularized data with specific prediction tasks, optimizing for efficient model training. The implementation accepts task labels following the MEDS label-schema and matches them with nearest prior feature vectors. It filters tabularized data to include only task-relevant events while maintaining sparse format for efficient storage. Labels must include subject_id, prediction_time, and boolean_value for binary classification. - - -### Input Data Structure -```text -${output_dir}/tabularize/ # Output from Stage 2 and 3 -${input_label_dir}/**/*.parquet # All parquet files in the `input_label_dir` are used as labels -``` - - -### Output Data Structure - -Labels are cached in: -```text -$output_label_cache_dir -│ -└─── -│ │ .parquet -│ │ .parquet -│ │ ... -│ -└─── - │ .parquet - │ .parquet - │ ... -``` - -For each shard, the labels are stored in a parquet file with the same name as the shard. The labels are stored in the `output_label_cache_dir` directory which by default is relative to the key word argument `$output_dir`: `output_label_cache_dir = ${output_dir}/${task_name}/labels`. - -Task specific tabularized data is cached in the following format: -```text -$output_tabularized_cache_dir -└─── -│ │ /1d/code/count.npz -│ │ /1d/value/sum.npz -| | /none/static/present.npz -| | /none/static/first.npz -| | ... -| | /7d/code/count.npz -│ │ /7d/value/sum.npz -│ │ ... -| | /1d/code/count.npz -│ │ /1d/value/sum.npz -│ │ /none/static/present.npz -| | /none/static/first.npz -│ │ ... -│ -└─── - │ ... -``` -The output structure is identical to the structure in Stages 2 and 3, but where we filter rows in the sparse matrix to only include events relevant to the task. This is done by selecting one row for each label that corresponds with the nearest prior event. The task-specific tabularized data is stored in the `output_tabularized_cache_dir` directory. By default this directory is relative to the key word argument `$output_dir`: `output_tabularized_cache_dir = ${output_dir}/${task_name}/task_cache`. - -## Stage 5: Model Training - -The final stage provides efficient model training capabilities, particularly optimized for XGBoost. The system incorporates extended memory support through sequential shard loading during training and efficient data loading through custom iterators. AutoML integration uses Optuna for hyperparameter optimization, tuning across model parameters, aggregation methods, window sizes, and feature selection thresholds. - -### Input Data Structure -```text -# Location of task, split, and shard specific tabularized data -${input_tabularized_cache_dir} # Output from Stage 4 -# Location of task, split, and shard specific label data -${input_label_cache_dir} # Output from Stage 4 -``` - -### Output Data Structure - -For single runs, the output structure is as follows: -```text -# Where to output the model and cached data -time_output_model_dir = ${output_model_dir}/${now:%Y-%m-%d_%H-%M-%S} -├── config.log -├── performance.log -└── xgboost.json # model weights -``` - -For `multirun` optuna hyperparameter sweeps we get the following output structure: -```text -# Where to output the model and cached data -time_output_model_dir = ${output_model_dir}/${now:%Y-%m-%d_%H-%M-%S} -├── best_trial -| ├── config.log -| ├── performance.log -| └── xgboost.json # model weights -├── hydra -| └── optimization_results.yaml # contains the optimal trial hyperparameters and performance -└── sweep_results # This folder contains raw results for every hyperparameter trial - └── - ├── config.log # model config log - ├── performance.log # model performance log - └── xgboost.json # model weights - └── - ... -``` - -`output_model_dir` is a keyword argument that specifies the directory where the model and cached data are stored. By default, we append the current date and time to the directory name to avoid overwriting previous runs, and use the `time_output_model_dir` variable to store the full path. If you use a different `model_launcher` than XGBoost, the model weights file will be named accordingly for that model (and will be a `.pkl` file instead of a `json`). - -### Supported Models and Processing Options -The default model is XGBoost, with additional options including KNN Classifier, Logistic Regression, Random Forest Classifier, SGD Classifier, and experimental AutoGluon support. Data processing options include sparse-preserving normalization (standard_scaler, max_abs_scaler) and imputation methods that convert to dense format (mean_imputer, median_imputer, mode_imputer). By default no normalization is applied and missing values are treated as missing by `xgboost` or as zero by other models. - -## Additional Considerations - -The architecture emphasizes robust memory management through sparse matrices and efficient data sharding, while supporting parallel processing and handling of high-dimensional feature spaces. The system is optimized for performance, minimizing memory footprint and computational overhead while enabling processing of datasets with hundreds of millions of events and tens of thousands of unique medical codes. diff --git a/docs/javascripts/directory-tree.js b/docs/javascripts/directory-tree.js new file mode 100644 index 0000000..8404e51 --- /dev/null +++ b/docs/javascripts/directory-tree.js @@ -0,0 +1,13 @@ +function toggleFolder(folderId) { + const content = document.getElementById(folderId); + const folderItem = content.previousElementSibling; + + // Toggle active state on folder item + folderItem.classList.toggle('active'); + + // Toggle visibility of content + content.classList.toggle('visible'); + + // Prevent event bubbling + event.stopPropagation(); +} diff --git a/docs/overview.md b/docs/overview.md index 31ad8a7..4e61217 100644 --- a/docs/overview.md +++ b/docs/overview.md @@ -1,299 +1,48 @@ +# The MEDS-Tab Architecture -# Core CLI Scripts Overview -We provide a set of core CLI scripts to facilitate the tabularization and modeling of MEDS data. These scripts are designed to be run in sequence to transform raw MEDS data into tabularized data and train a model on the tabularized data. The following is a high-level overview of the core CLI scripts: +MEDS-Tab addresses two key challenges in healthcare machine learning: efficiently tabularizing large-scale electronic health record (EHR) data and training competitive baseline models on this tabularized data. This document outlines the architecture and implementation details of MEDS-Tab's pipeline. -## 1. **`MEDS_transform-reshard_to_split`**: +MEDS-Tab is designed to scale to hundreds of millions of events and tens of thousands of unique medical codes. Performance optimization is achieved through: -This optional command reshards the data. A core challenge in tabularization is the high memory usage and slow compute time. We shard the data into small shards to reduce the memory usage as we can independently tabularize each shard, and we can reduce cpu time by parallelizing the processing of these shards across workers that are independently processing different shards. +* Efficient parallel processing when appropriate +* Strategic use of sparse data structures +* Memory-aware data loading and processing +* Configurable processing parameters for different hardware capabilities -```bash -MEDS_transform-reshard_to_split \ - --multirun \ - worker="range(0,6)" \ - hydra/launcher=joblib \ - input_dir="$MEDS_DIR" \ - cohort_dir="$MEDS_RESHARD_DIR" \ - 'stages=["reshard_to_split"]' \ - stage="reshard_to_split" \ - stage_configs.reshard_to_split.n_subjects_per_shard=2500 -``` +## Overview -??? note "Args Description" +The MEDS-Tab pipeline consists of six main stages, with the first being optional. The pipeline begins with an optional (1) data resharding stage that optimizes processing by restructuring input data into manageable chunks. This is followed by (2) data description, which computes some summary statistics over the features in the dataset. The core processing happens in the (3) static and (4) time-series tabularization stages, which transform the data into a format suitable for tabular machine learning. (5) Task-specific data caching then aligns this data with prediction tasks, and finally, the (6) model training stage provides efficient training capabilities with support for multiple model types and hyperparameter optimization. - - `--multirun`: This is an optional argument to specify that the command should be run in parallel. We use this here to parallelize the resharing of the data. - - `hydra/launcher`: This is an optional argument to specify the launcher. When using multirun you should specify the launcher. We use joblib here which enables parallelization on a single machine. - - `worker`: When using joblib or a hydra slurm launcher, the range of workers must be defined as it specifies the number of parallel workers to spawn. We use 6 workers here. - - `input_dir`: The directory containing the MEDS data. - - `cohort_dir`: The directory to store the resharded data. - - `stages`: The stages to run. We only run the reshard_to_split stage here. MEDS Transform allows for a sequence of stages to be defined an run which is why this is a list. - - `stage`: The specific stage to run. We run the reshard_to_split stage here. It must be one of the stages in the `stages` kwarg list. - - `stage_configs.reshard_to_split.n_subjects_per_shard`: The number of subjects per shard. We use 2500 subjects per shard here. +## Memory Management Via Sparse Data Structures -For the rest of the tutorial we will assume that the data has been reshared into the `MEDS_RESHARD_DIR` directory, but this step is optional, and you could instead use the original data directory, `MEDS_DIR`. If you experience high memory issues in later stages, you should try reducing `stage_configs.reshard_to_split.n_subjects_per_shard` to a smaller number. +Memory management is a central consideration in MEDS-Tab's design. The system employs several key strategies to handle large-scale medical datasets efficiently: +Sparse matrix operations form the foundation of our memory management approach. We utilize scipy.sparse for memory-efficient storage of sparse non-zero elements, which is particularly effective for medical data where most potential features are not present for any given patient at any given time. -## 2. **`meds-tab-describe`**: +Data sharding complements our sparse matrix approach by breaking data into manageable chunks. This enables both memory-efficient processing and parallelization. Shards are processed independently, allowing us to handle datasets that would be impossible to process as a single unit. -This command processes MEDS data shards to compute the frequencies of different code types. It differentiates codes into the following categories: +The system implements efficient aggregation using Polars for fast rolling window computations. This optimizes same-day event aggregation and maintains memory efficiency during temporal calculations. +## Improved Computational Speed Via Parallel Processing -* dynamic codes (codes with timestamps) -* dynamic numeric values (codes with timestamps and numerical values) -* static codes (codes without timestamps) -* static numeric values (codes without timestamps but with numerical values). +Our processing strategy differentiates between sequential and parallel operations based on computational needs and data dependencies. The data description and static tabularization stages operate sequentially, as they have manageable computational requirements. In contrast, time-series tabularization, task-specific caching, and model training leverage parallel processing over independent workers (which may be spawned on different cores on a local machine or over a slurm cluster) to handle their more intensive computational demands. - This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `input_dir` argument specified as a hydra-style command line argument. +Data flow through the pipeline is optimized through caching and sharding. Each stage's output is structured to minimize memory requirements while maintaining accessibility for subsequent stages. The system preserves sparsity wherever possible and uses efficient shard management to increase processing speed and reduce total memory consumption. -```bash -meds-tab-describe \ - "input_dir=${MEDS_RESHARD_DIR}/data" "output_dir=$OUTPUT_TABULARIZATION_DIR" -``` -This stage is not parallelized as it runs very quickly. +## Feature Engineering Via Rolling Windows and Aggregation Functions -??? note "Args Description" +MEDS-Tab implements a comprehensive feature engineering approach that handles both static and temporal data. For static features, we capture both presence and first-recorded values (as there should be only one occurrence of a static code). Time-series features are processed through various aggregation methods including counts, sums, minimums, and maximums. These aggregations can be computed over multiple time windows (1 day, 30 days, 365 days, or the full patient history), providing temporal context at different scales. - - `input_dir`: The directory containing the MEDS data. - - `output_dir`: The directory to store the tabularized data. +Our feature engineering framework maintains flexibility while enforcing consistency. All aggregations preserve sparsity where possible, and the system includes configurable thresholds for feature inclusion based on frequency and relevance to the target task. -## 3. **`meds-tab-tabularize-static`**: +## Model Support and Normalization/Imputation Options -Filters and processes the dataset based on the count of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `subject_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient. +The architecture includes robust support for multiple model types, with XGBoost as the primary implementation. Additional supported models include KNN Classifier, Logistic Regression, Random Forest Classifier, and SGD Classifier. An experimental AutoGluon integration provides automated model selection and tuning capabilities. - **Example: Tabularizing static data** with the minimum code count of 10, window sizes of `[1d, 30d, 365d, full]`, and value aggregation methods of `[static/present, static/first, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]` +Data processing options are designed to maintain efficiency while providing necessary transformations. Normalization options (standard scaler, max abs scaler) preserve sparsity, while imputation methods (mean, median, mode) are available when dense representations are required or beneficial. - ```console - meds-tab-tabularize-static input_dir="path_to_data" \ - tabularization.min_code_inclusion_count=10 \ - tabularization.window_sizes=[1d,30d,365d,full] \ - do_overwrite=False \ - tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]" - ``` +## Additional Design Considerations - - For the exhaustive examples of value aggregations, see [`/src/MEDS_tabular_automl/utils.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/src/MEDS_tabular_automl/utils.py#L24) +**Extensibility and maintainability**: The pipeline's modular design allows for the addition of new feature types, aggregation methods, and model support. Contributions are welcome! -!!! note - - In addition to `min_code_inclusion_count` there are several other parameters that can be set tabularization to restrict the codes that are included in the tabularized data. These are: - * `allowed_codes`: a list of codes to include in the tabularized data - * `min_code_inclusion_count`: The minimum number of times a code must appear in the data to be included in the tabularized data - * `min_code_inclusion_frequency` The minimum normalized frequency (i.e. normalized by dividing the code's count by the total number of observations across all codes in the dataset) required for a code to be included. - * `max_included_codes`: The maximum number of codes to include in the tabularized data - - -```bash -meds-tab-tabularize-static \ - "input_dir=${MEDS_RESHARD_DIR}/data" "output_dir=$OUTPUT_TABULARIZATION_DIR" \ -``` - -This stage is not parallelized as it runs very quickly. -??? note "Args Description" - - `input_dir`: The directory containing the MEDS data. - - `output_dir`: The directory to store the tabularized data. - - -## 4. **`meds-tab-tabularize-time-series`**: - -Iterates through combinations of a shard, `window_size`, and `aggregation` to generate feature vectors that aggregate patient data for each unique `subject_id` x `time`. This stage (and the previous stage) uses sparse matrix formats to efficiently handle the computational and storage demands of rolling window calculations on large datasets. We support parallelization through Hydra's [`--multirun`](https://hydra.cc/docs/intro/#multirun) flag and the [`joblib` launcher](https://hydra.cc/docs/plugins/joblib_launcher/#internaldocs-banner). - - **Example: Aggregate time-series data** on features across different `window_sizes` - - -```bash -meds-tab-tabularize-time-series \ - --multirun \ - worker="range(0,$N_PARALLEL_WORKERS)" \ - hydra/launcher=joblib \ - "input_dir=${MEDS_RESHARD_DIR}/data" "output_dir=$OUTPUT_TABULARIZATION_DIR" \ - tabularization.min_code_inclusion_count=10 \ - tabularization.window_sizes=[1d,30d,365d,full] \ - tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max] -``` - -!!! warning - - This stage is the most memory intensive stage! This stage should be parallelized to speed up the processing of the data. If you run out of memory, either reduce the workers or reshard your data with `MEDS_transform-reshard_to_split` setting `stage_configs.reshard_to_split.n_subjects_per_shard` to a smaller number. This stage is also one of the the slowest stages. - -!!! warning - - You must use the same code inclusion parameters (which in this example is just `tabularization.min_code_inclusion_count`) as in the previous stage, `meds-tab-tabularize-static`, to ensure that the same codes are included in the tabularized data. - -??? note "Args Description" - - - `--multirun`: This is an optional argument to specify that the command should be run in parallel. We use this here to parallelize the resharing of the data. - - `hydra/launcher`: This is an optional argument to specify the launcher. When using multirun you should specify the launcher. We use joblib here which enables parallelization on a single machine. - - `worker`: When using joblib or a hydra slurm launcher, the range of workers must be defined as it specifies the number of parallel workers to spawn. We use `$N_PARALLEL_WORKERS` workers here. - - `input_dir`: The directory containing the MEDS data. - - `output_dir`: The directory to store the tabularized data. - - `tabularization.min_code_inclusion_count`: The minimum code inclusion frequency. We use 10 here, so only codes that appear at least 10 times in the data will be included. - - `tabularization.window_sizes`: The window sizes to use. We use `[1d,30d,365d,full]` here. This means we will generate features for the last day, last 30 days, last 365 days, and the full history of the patient. - - `tabularization.aggs`: The aggregation functions to use. We use `[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]` here. This means we will generate features for the presence of a static code, the value of a static code, the count of dynamic codes, the count of dynamic values, the sum of dynamic values, the sum of squared dynamic values, the minimum dynamic value, and the maximum dynamic value. - - - -## 5. **`meds-tab-cache-task`**: - -Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `input_dir`. - - **Example: Align tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES) - - ```console - meds-tab-cache-task input_dir="path_to_data" \ - task_name=$TASK \ - tabularization.min_code_inclusion_count=10 \ - tabularization.window_sizes=[1d,30d,365d,full] \ - tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max] - ``` - - -```bash -meds-tab-cache-task \ - --multirun \ - hydra/launcher=joblib \ - worker="range(0,$N_PARALLEL_WORKERS)" \ - "input_dir=${MEDS_RESHARD_DIR}/data" "output_dir=$OUTPUT_TABULARIZATION_DIR" \ - "input_label_dir=${TASKS_DIR}/${TASK}/" "task_name=${TASK}" - tabularization.min_code_inclusion_count=10 \ - tabularization.window_sizes=[1d,30d,365d,full] \ - tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max] -``` - -!!! warning - - This stage is the slowest stage, but should not be as memory intensive, so make sure to parallelize across as many workers as possible. - -!!! warning - - You must use the same code inclusion parameters (which in this example is just `tabularization.min_code_inclusion_count`) as in the previous stages, `meds-tab-tabularize-static` and `meds-tab-tabularize-time-series`, to ensure that the same codes are included in the tabularized data. - -??? note "Args Description" - - - `--multirun`: This is an optional argument to specify that the command should be run in parallel. We use this here to parallelize the resharing of the data. - - `hydra/launcher`: This is an optional argument to specify the launcher. When using multirun you should specify the launcher. We use joblib here which enables parallelization on a single machine. - - `worker`: When using joblib or a hydra slurm launcher, the range of workers must be defined as it specifies the number of parallel workers to spawn. We use `$N_PARALLEL_WORKERS` workers here. - - `input_dir`: The directory containing the MEDS data. - - `output_dir`: The directory to store the tabularized data. - - `input_label_dir`: The directory containing the labels (following the [meds label-schema](https://github.com/Medical-Event-Data-Standard/meds?tab=readme-ov-file#the-label-schema)) for the task. - - `task_name`: The name of the task to cache the labels for. - - `tabularization.min_code_inclusion_count`: The minimum code inclusion frequency. - - `tabularization.window_sizes`: The window sizes to use. - - `tabularization.aggs`: The aggregation functions to use. - - -## 6. **`meds-tab-model`**: - -Trains a tabular model using user-specified parameters. You can train a single xgboost model with the following command: -```bash -meds-tab-model \ - model_launcher=xgboost \ - "input_dir=${MEDS_RESHARD_DIR}/data" "output_dir=$OUTPUT_TABULARIZATION_DIR" \ - "output_model_dir=${OUTPUT_MODEL_DIR}/${TASK}/" "task_name=$TASK" \ - tabularization.min_code_inclusion_count=10 \ - "tabularization.window_sizes=[1d,30d,365d,full]" \ - "tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]" -``` - -??? note "Args Description" - - `model_launcher`: The launcher to use for the model. choose one in `xgboost`, `knn_classifier`, `logistic_regression`, `random_forest_classifier`, `sgd_classifier`. - - `input_dir`: The directory containing the MEDS data. - - `output_dir`: The directory to store the tabularized data. - - `output_model_dir`: The directory to store the model. - - `hydra.sweeper.n_trials`: The number of trials to run in the hyperparameter sweep. - - `hydra.sweeper.n_jobs`: The number of parallel jobs to run in the hyperparameter sweep. - - `task_name`: The name of the task to cache the labels for. - - `tabularization.min_code_inclusion_count`: The minimum code inclusion frequency. - - `tabularization.window_sizes`: The window sizes to use. - - `tabularization.aggs`: The aggregation functions to use. - -??? note "Data Preprocessing Options" - - The tool provides several options for data preprocessing, though these may not always be necessary depending on your chosen model: - - - **Tree-based methods** (e.g., XGBoost): - - Insensitive to normalization - - Generally don't benefit from missing value imputation - - XGBoost natively handles learning decisions for missing data - - **Other supported models** (`knn_classifier`, `logistic_regression`, `random_forest_classifier`, `sgd_classifier`): - - Support sparse matrices - - May benefit from normalization or imputation for optimal performance - - **Available preprocessing options:** - - - *Normalization* (maintains sparsity): - - `standard_scaler`: Unit variance scaling - - `max_abs_scaler`: Maximum absolute value scaling - - - *Imputation* (converts to dense format which significantly increases memory usage!!!): - - `mean_imputer`: Mean imputation - - `median_imputer`: Median imputation - - `mode_imputer`: Mode imputation - -You can also run an [optuna](https://optuna.org/) hyperparameter sweep by adding the `--multirun` flag and can control the number of trials with `hydra.sweeper.n_trials` and parallel jobs with `hydra.sweeper.n_jobs`: - -```bash -meds-tab-model \ - --multirun \ - model_launcher=xgboost \ - "input_dir=${MEDS_RESHARD_DIR}/data" "output_dir=$OUTPUT_TABULARIZATION_DIR" \ - "output_model_dir=${OUTPUT_MODEL_DIR}/${TASK}/" "task_name=$TASK" \ - "hydra.sweeper.n_trials=1000" "hydra.sweeper.n_jobs=${N_PARALLEL_WORKERS}" \ - tabularization.min_code_inclusion_count=10 \ - tabularization.window_sizes=$(generate-subsets [1d,30d,365d,full]) \ - tabularization.aggs=$(generate-subsets [static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]) -``` - -??? note "Args Description" - - `multirun`: This is a required argument when sweeping and specifies that we are performing a hyperparameter sweep and using optuna. - - `model_launcher`: The launcher to use for the model. choose one in `xgboost`, `knn_classifier`, `logistic_regression`, `random_forest_classifier`, `sgd_classifier`. - - `input_dir`: The directory containing the MEDS data. - - `output_dir`: The directory to store the tabularized data. - - `output_model_dir`: The directory to store the model. - - `hydra.sweeper.n_trials`: The number of trials to run in the hyperparameter sweep. - - `hydra.sweeper.n_jobs`: The number of parallel jobs to run in the hyperparameter sweep. - - `task_name`: The name of the task to cache the labels for. - - `tabularization.min_code_inclusion_count`: The minimum code inclusion frequency. - - `tabularization.window_sizes`: The window sizes to use. - - `tabularization.aggs`: The aggregation functions to use. - -??? note "Why `generate-subsets`?" - **`generate-subsets`**: Generates and prints a sorted list of all non-empty subsets from a comma-separated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations. - - For example, you can directly call **`generate-subsets`** in the command line: - - ```console - generate-subsets [2,3,4] \ - [2], [2, 3], [2, 3, 4], [2, 4], [3], [3, 4], [4] - ``` - - This could be used in the command line in concert with other calls. For example, the following call: - - ```console - meds-tab-model --multirun tabularization.window_sizes=$(generate-subsets [1d,2d,7d,full]) - ``` - - would resolve to: - - ```console - meds-tab-model --multirun tabularization.window_sizes=[1d],[1d,2d],[1d,2d,7d],[1d,2d,7d,full],[1d,2d,full],[1d,7d],[1d,7d,full],[1d,full],[2d],[2d,7d],[2d,7d,full],[2d,full],[7d],[7d,full],[full] - ``` - - which can then be correctly interpreted by Hydra's multirun logic to sweep over all possible combinations of window sizes, during hyperparameter tuning! - - - -!!! note "Code Inclusion Parameters" - - In this modeling stage, you can change the code inclusion parameters from the previous tabularization and task caching stages, and treat them as a tunable hyperparameter - - In addition to the previously defined code inclusion parameters, there are two others that we allow only in modeling (as they are task specific): - * `min_correlation`: The minimum correlation a code must have with the target to be included in the tabularized data - * `max_by_correlation`: The maximum number of codes to include in the tabularized data based on correlation with the target. Specifically we sort the codes by correlation with the target and include the top `max_by_correlation` codes. - -??? example "Experimental Feature" - - We also support an autogluon based hyperparameter and model search: - ```bash - meds-tab-autogluon model_launcher=autogluon \ - "input_dir=${MEDS_RESHARD_DIR}/data" "output_dir=$OUTPUT_TABULARIZATION_DIR" \ - "output_model_dir=${OUTPUT_MODEL_DIR}/${TASK}/" "task_name=$TASK" \ - ``` - run `meds-tab-autogluon model_launcher=autogluon --help` to see all kwargs. Autogluon requires a lot of memory as it makes all the sparse matrices dense, and is not recommended for large datasets. +**Highly Configurable**: This pipeline is highly configurable via parameters that allow users to adjust processing based on their specific needs and hardware constraints. See the usage guide for more details. diff --git a/docs/stylesheets/directory-tree.css b/docs/stylesheets/directory-tree.css new file mode 100644 index 0000000..f1999e1 --- /dev/null +++ b/docs/stylesheets/directory-tree.css @@ -0,0 +1,58 @@ +:root { + --md-admonition-icon--folder: url('data:image/svg+xml;charset=utf-8,') +} + +.md-typeset .admonition.folder, +.md-typeset details.folder { + border: none; + border-left: 3px solid rgb(40, 72, 214); + box-shadow: none; + background: none; +} + +.md-typeset .folder > .admonition-title, +.md-typeset .folder > summary { + background: none; + padding: 0.2rem 0.5rem; + margin: 0; + font-weight: normal; + display: flex; + align-items: center; + min-height: 24px; /* Ensure consistent height */ + gap: 0.5rem; /* Add space between icon and text */ +} + +.md-typeset .folder > .admonition-title::before, +.md-typeset .folder > summary::before { + background-color: var(--md-default-fg-color--light); + -webkit-mask-image: var(--md-admonition-icon--folder); + mask-image: var(--md-admonition-icon--folder); + position: relative; /* Remove absolute positioning */ + margin: 0; /* Remove default margins */ + top: 0; /* Remove top positioning */ + left: 0; /* Remove left positioning */ + flex-shrink: 0; /* Prevent icon from shrinking */ +} + +.md-typeset .folder > .admonition-title:hover, +.md-typeset .folder > summary:hover { + background: var(--md-default-fg-color--lightest); + border-radius: 4px; +} + +.md-typeset details.folder > div.content { + padding-left: 1.8rem; + margin: 0; +} + +.md-typeset details.folder ul { + margin: 0.2rem 0 0.2rem 0.8rem; + list-style: none; +} + +.md-typeset details.folder ul li { + margin: 0.2rem 0; + display: flex; + align-items: center; + gap: 0.5rem; +} diff --git a/docs/usage_guide.md b/docs/usage_guide.md new file mode 100644 index 0000000..2d2cedc --- /dev/null +++ b/docs/usage_guide.md @@ -0,0 +1,510 @@ +# Core Usage Guide + +We provide a set of core CLI scripts to facilitate the tabularization and modeling of MEDS data. These scripts are designed to be run in sequence to transform raw MEDS data into tabularized data and train a model on the tabularized data. + +## 1. **`MEDS_transform-reshard_to_split`** + +This optional command reshards the data. A core challenge in tabularization is the high memory usage and slow compute time. We shard the data into small shards to reduce the memory usage as we can independently tabularize each shard, and we can reduce cpu time by parallelizing the processing of these shards across workers that are independently processing different shards. + +```bash +MEDS_transform-reshard_to_split \ + --multirun \ + worker="range(0,6)" \ + hydra/launcher=joblib \ + input_dir="$MEDS_DIR" \ + cohort_dir="$MEDS_RESHARD_DIR" \ + 'stages=["reshard_to_split"]' \ + stage="reshard_to_split" \ + stage_configs.reshard_to_split.n_subjects_per_shard=2500 +``` + +??? note "Args Description" + * `--multirun`: This is an optional argument to specify that the command should be run in parallel. We use this here to parallelize the resharing of the data. + * `hydra/launcher`: This is an optional argument to specify the launcher. When using multirun you should specify the launcher. We use joblib here which enables parallelization on a single machine. + * `worker`: When using joblib or a hydra slurm launcher, the range of workers must be defined as it specifies the number of parallel workers to spawn. We use 6 workers here. + * `input_dir`: The directory containing the MEDS data. + * `cohort_dir`: The directory to store the resharded data. + * `stages`: The stages to run. We only run the reshard_to_split stage here. MEDS Transform allows for a sequence of stages to be defined an run which is why this is a list. + * `stage`: The specific stage to run. We run the reshard_to_split stage here. It must be one of the stages in the `stages` kwarg list. + * `stage_configs.reshard_to_split.n_subjects_per_shard`: The number of subjects per shard. We use 2500 subjects per shard here. + +### Input Data Structure +```text +MEDS_DIR/ +│ +└─── +│ │ .parquet +│ │ .parquet +│ │ ... +│ +└─── + │ .parquet + │ .parquet + │ ... +``` + +### Output Data Structure (New Files) +```text +MEDS_RESHARD_DIR/ +│ +└─── +│ │ .parquet +│ │ .parquet +│ │ ... +│ +└─── + │ .parquet + │ .parquet + │ ... +``` + +### Complete Directory Structure +!!! abstract "Stage 0 Directory Structure" + ??? folder "MEDS_DIR" + ??? folder "SPLIT A" + * 📄 SHARD 0.parquet + ??? folder "SPLIT B" + * 📄 SHARD 0.parquet + ??? folder "MEDS_RESHARD_DIR" + ??? folder "SPLIT A" + * 📄 SHARD 0.parquet + * 📄 SHARD 1.parquet + * 📄 ... + ??? folder "SPLIT B" + * 📄 SHARD 0.parquet + * 📄 SHARD 1.parquet + * 📄 ... + +For the rest of the tutorial we will assume that the data has been reshared into the `MEDS_RESHARD_DIR` directory, but this step is optional, and you could instead use the original data directory, `MEDS_DIR`. If you experience high memory issues in later stages, you should try reducing `stage_configs.reshard_to_split.n_subjects_per_shard` to a smaller number. + +## 2. **`meds-tab-describe`** + +This command processes MEDS data shards to compute the frequencies of different code types. It differentiates codes into the following categories: + +* dynamic codes (codes with timestamps) +* dynamic numeric values (codes with timestamps and numerical values) +* static codes (codes without timestamps) +* static numeric values (codes without timestamps but with numerical values) + +This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `OUTPUT_DIR` argument specified as a hydra-style command line argument. + +```bash +meds-tab-describe \ + "input_dir=${MEDS_RESHARD_DIR}/data" "output_dir=$OUTPUT_DIR" +``` + +This stage is not parallelized as it runs very quickly. + +??? note "Args Description" + * `input_dir`: The directory containing the MEDS data. + * `output_dir`: The directory to store the tabularized data. + +### Input Data Structure +```text +MEDS_RESHARD_DIR/ +│ +└─── +│ │ .parquet +│ │ .parquet +│ │ ... +│ +└─── + │ .parquet + │ .parquet + │ ... +``` + +### Output Data Structure (New Files) +```text +OUTPUT_DIR/ +│ +└─── metadata + │ codes.parquet +``` + +### Complete Directory Structure +!!! abstract "Stage 1 Directory Structure" + ??? folder "MEDS_DIR" + ??? folder "SPLIT A" + * 📄 SHARD 0.parquet + ??? folder "SPLIT B" + * 📄 SHARD 0.parquet + ??? folder "MEDS_RESHARD_DIR" + ??? folder "SPLIT A" + * 📄 SHARD 0.parquet + * 📄 SHARD 1.parquet + * 📄 ... + ??? folder "SPLIT B" + * 📄 SHARD 0.parquet + * 📄 SHARD 1.parquet + * 📄 ... + ??? folder "OUTPUT_DIR" + ??? folder "metadata" + * 📄 codes.parquet + +## 3. **`meds-tab-tabularize-static`** + +Filters and processes the dataset based on the count of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `subject_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient. + +```bash +meds-tab-tabularize-static \ + "input_dir=${MEDS_RESHARD_DIR}/data" \ + "output_dir=$OUTPUT_DIR" \ + tabularization.min_code_inclusion_count=10 \ + tabularization.window_sizes=[1d,30d,365d,full] \ + do_overwrite=False \ + tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max] +``` + +This stage is not parallelized as it runs very quickly. + +??? note "Args Description" + * `input_dir`: The directory containing the MEDS data. + * `output_dir`: The directory to store the tabularized data. + * `tabularization.min_code_inclusion_count`: The minimum number of times a code must appear. + * `tabularization.window_sizes`: The window sizes to use for aggregations. + * `do_overwrite`: Whether to overwrite existing files. + * `tabularization.aggs`: The aggregation methods to use. + +!!! note "Code Inclusion Parameters" + In addition to `min_code_inclusion_count` there are several other parameters that can be set in tabularization to restrict the codes that are included: + * `allowed_codes`: a list of codes to include in the tabularized data + * `min_code_inclusion_count`: The minimum number of times a code must appear + * `min_code_inclusion_frequency`: The minimum normalized frequency required + * `max_included_codes`: The maximum number of codes to include + +### Input Data Structure +```text +[Previous structure remains the same] +``` + +### Output Data Structure (New Files) +```text +OUTPUT_DIR/ +└─── tabularize/ + └─── + │ │ /none/static/present.npz + │ │ /none/static/first.npz + │ │ /none/static/present.npz + │ │ ... + │ + └─── + │ /none/static/present.npz + │ /none/static/first.npz + │ /none/static/present.npz + │ ... +``` + +### Complete Directory Structure After Static Tabularization +!!! abstract "Stage 3 Directory Structure" + ??? folder "MEDS_DIR" + ??? folder "SPLIT A" + * 📄 SHARD 0.parquet + ??? folder "SPLIT B" + * 📄 SHARD 0.parquet + ??? folder "MEDS_RESHARD_DIR" + ??? folder "SPLIT A" + * 📄 SHARD 0.parquet + * 📄 SHARD 1.parquet + * 📄 ... + ??? folder "SPLIT B" + * 📄 SHARD 0.parquet + * 📄 SHARD 1.parquet + * 📄 ... + ??? folder "OUTPUT_DIR" + ??? folder "metadata" + * 📄 codes.parquet + ??? folder "tabularize" + ??? folder "SPLIT A" + ??? folder "SHARD 0" + ??? folder "none/static" + * 📄 present.npz + * 📄 first.npz + ??? folder "SHARD 1" + ??? folder "none/static" + * 📄 present.npz + * 📄 first.npz + ??? folder "SPLIT B" + [Similar structure to SPLIT A] + +## 4. **`meds-tab-tabularize-time-series`** + +This stage handles the computationally intensive task of converting temporal medical data into feature vectors. The process employs several key optimizations: sparse matrix operations utilizing scipy.sparse for memory-efficient storage, data sharding that enables parallel processing, and efficient aggregation using Polars for fast rolling window computations. + +```bash +meds-tab-tabularize-time-series \ + --multirun \ + worker="range(0,$N_PARALLEL_WORKERS)" \ + hydra/launcher=joblib \ + "input_dir=${MEDS_RESHARD_DIR}/data" \ + "output_dir=$OUTPUT_DIR" \ + tabularization.min_code_inclusion_count=10 \ + tabularization.window_sizes=[1d,30d,365d,full] \ + tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max] +``` + +!!! warning "Memory Usage" + This stage is the most memory intensive stage! This stage should be parallelized to speed up the processing of the data. If you run out of memory, either reduce the workers or reshard your data with `MEDS_transform-reshard_to_split` setting `stage_configs.reshard_to_split.n_subjects_per_shard` to a smaller number. + +!!! warning "Code Inclusion Parameters" + You must use the same code inclusion parameters (which in this example is just `tabularization.min_code_inclusion_count`) as in the previous stage, `meds-tab-tabularize-static`, to ensure that the same codes are included in the tabularized data. + +### Input Data Structure +```text +[Previous structure remains the same] +``` + +### Output Data Structure (New Files) +```text +OUTPUT_DIR/tabularize/ +│ +└─── +│ │ /1d/code/count.npz +│ │ /1d/value/sum.npz +| | ... +| | /7d/code/count.npz +│ │ /7d/value/sum.npz +│ │ ... +| | /1d/code/count.npz +│ │ /1d/value/sum.npz +│ │ ... +│ +└─── + │ [Similar structure to SPLIT A] +``` + +### Complete Directory Structure +!!! abstract "Stage 4 Directory Structure" + ??? folder "MEDS_DIR" + [Previous structure] + ??? folder "MEDS_RESHARD_DIR" + [Previous structure] + ??? folder "OUTPUT_DIR" + ??? folder "metadata" + * 📄 codes.parquet + ??? folder "tabularize" + ??? folder "SPLIT A" + ??? folder "SHARD 0" + ??? folder "none/static" + * 📄 present.npz + * 📄 first.npz + ??? folder "1d" + ??? folder "code" + * 📄 count.npz + ??? folder "value" + * 📄 sum.npz + ??? folder "7d" + ??? folder "code" + * 📄 count.npz + ??? folder "value" + * 📄 sum.npz + ??? folder "SHARD 1" + [Similar structure to SHARD 0] + ??? folder "SPLIT B" + [Similar structure to SPLIT A] + +## 5. **`meds-tab-cache-task`** + +Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `input_dir`. + +```bash +meds-tab-cache-task \ + --multirun \ + hydra/launcher=joblib \ + worker="range(0,$N_PARALLEL_WORKERS)" \ + "input_dir=${MEDS_RESHARD_DIR}/data" \ + "output_dir=$OUTPUT_DIR" \ + "input_label_dir=${TASKS_DIR}/${TASK}/" \ + "task_name=${TASK}" \ + tabularization.min_code_inclusion_count=10 \ + tabularization.window_sizes=[1d,30d,365d,full] \ + tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max] +``` + +!!! warning "Stage Duration" + This stage is the slowest stage, but should not be as memory intensive, so make sure to parallelize across as many workers as possible. + +!!! warning "Code Inclusion Parameters" + You must use the same code inclusion parameters (which in this example is just `tabularization.min_code_inclusion_count`) as in the previous stages to ensure that the same codes are included in the tabularized data. + +### Input Data Structure +```text +# Previous structure plus: +TASKS_DIR/ +└─── TASK/ + │ *.parquet # All parquet files containing labels +``` + +### Output Data Structure (New Files) +```text +OUTPUT_DIR/ +└─── TASK/ + ├─── labels/ + │ └─── + │ │ .parquet + │ │ .parquet + │ └─── + │ │ .parquet + │ │ .parquet + └─── task_cache/ + [Similar structure to tabularize/ but filtered for task] +``` + +### Complete Directory Structure +!!! abstract "Stage 5 Directory Structure" + ??? folder "MEDS_DIR" + [Previous structure] + ??? folder "MEDS_RESHARD_DIR" + [Previous structure] + ??? folder "OUTPUT_DIR" + ??? folder "metadata" + * 📄 codes.parquet + ??? folder "tabularize" + [Previous structure] + ??? folder "${TASK}" + ??? folder "labels" + ??? folder "SPLIT A" + * 📄 SHARD 0.parquet + * 📄 SHARD 1.parquet + ??? folder "SPLIT B" + * 📄 SHARD 0.parquet + * 📄 SHARD 1.parquet + ??? folder "task_cache" + ??? folder "SPLIT A" + ??? folder "SHARD 0" + ??? folder "none/static" + * 📄 present.npz + * 📄 first.npz + ??? folder "1d" + ??? folder "code" + * 📄 count.npz + ??? folder "value" + * 📄 sum.npz + ??? folder "7d" + ??? folder "code" + * 📄 count.npz + ??? folder "value" + * 📄 sum.npz + ??? folder "SHARD 1" + [Similar structure to SHARD 0] + ??? folder "SPLIT B" + [Similar structure to SPLIT A] + +## 6. **`meds-tab-model`** + +Trains a tabular model using user-specified parameters. The system incorporates extended memory support through sequential shard loading during training and efficient data loading through custom iterators. + +### Single Model Training +```bash +meds-tab-model \ + model_launcher=xgboost \ + "input_dir=${MEDS_RESHARD_DIR}/data" \ + "output_dir=$OUTPUT_DIR" \ + "output_model_dir=${OUTPUT_MODEL_DIR}/${TASK}/" \ + "task_name=$TASK" \ + tabularization.min_code_inclusion_count=10 \ + tabularization.window_sizes=[1d,30d,365d,full] \ + tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max] +``` + +### Hyperparameter Optimization +```bash +meds-tab-model \ + --multirun \ + model_launcher=xgboost \ + "input_dir=${MEDS_RESHARD_DIR}/data" \ + "output_dir=$OUTPUT_DIR" \ + "output_model_dir=${OUTPUT_MODEL_DIR}/${TASK}/" \ + "task_name=$TASK" \ + "hydra.sweeper.n_trials=1000" \ + "hydra.sweeper.n_jobs=${N_PARALLEL_WORKERS}" \ + tabularization.min_code_inclusion_count=10 \ + tabularization.window_sizes=[1d,30d,365d,full] \ + tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max] +``` + +??? note "Args Description for Model Stage" + * `model_launcher`: Choose from `xgboost`, `knn_classifier`, `logistic_regression`, `random_forest_classifier`, `sgd_classifier` + * `input_dir`: The directory containing the MEDS data + * `output_dir`: The directory storing tabularized data + * `output_model_dir`: Where to save model outputs + * `hydra.sweeper.n_trials`: Number of trials for hyperparameter optimization + * `hydra.sweeper.n_jobs`: Number of parallel jobs for optimization + +??? note "Code Inclusion Parameters in Modeling" + In this modeling stage, you can change the code inclusion parameters from previous stages and treat them as tunable hyperparameters. Additional task-specific parameters include: + * `min_correlation`: Minimum correlation with target required + * `max_by_correlation`: Maximum number of codes to include based on correlation with target + +??? note "Data Preprocessing Options" + * **Tree-based methods** (e.g., XGBoost): + * Insensitive to normalization + * Generally don't benefit from missing value imputation + * XGBoost handles missing data natively + * **Other supported models**: + * Support sparse matrices + * May benefit from normalization or imputation + + Available preprocessing options: + * *Normalization* (maintains sparsity): + * `standard_scaler` + * `max_abs_scaler` + * *Imputation* (converts to dense format): + * `mean_imputer` + * `median_imputer` + * `mode_imputer` + +### Input/Output Data Structure +```text +[Previous structure remains the same for input] + +# New output structure: +OUTPUT_MODEL_DIR/ +└─── TASK/YYYY-MM-DD_HH-MM-SS/ + ├── best_trial/ + │ ├── config.log + │ ├── performance.log + │ └── xgboost.json + ├── hydra/ + │ └── optimization_results.yaml + └── sweep_results/ + └── TRIAL_*/ + ├── config.log + ├── performance.log + └── xgboost.json +``` + +### Complete Directory Structure +!!! abstract "Final Directory Structure" + ??? folder "MEDS_DIR" + [Previous structure] + ??? folder "MEDS_RESHARD_DIR" + [Previous structure] + ??? folder "OUTPUT_DIR" + [Previous structure] + ??? folder "OUTPUT_MODEL_DIR" + ??? folder "TASK/YYYY-MM-DD_HH-MM-SS" + ??? folder "best_trial" + * 📄 config.log + * 📄 performance.log + * 📄 xgboost.json + ??? folder "hydra" + * 📄 optimization_results.yaml + ??? folder "sweep_results" + ??? folder "TRIAL_1_ID" + * 📄 config.log + * 📄 performance.log + * 📄 xgboost.json + ??? folder "TRIAL_2_ID" + [Similar structure to TRIAL_1_ID] + +??? note "Experimental Feature" + We also support an autogluon based hyperparameter and model search: + ```bash + meds-tab-autogluon model_launcher=autogluon \ + "input_dir=${MEDS_RESHARD_DIR}/data" \ + "output_dir=$OUTPUT_DIR" \ + "output_model_dir=${OUTPUT_MODEL_DIR}/${TASK}/" \ + "task_name=$TASK" + ``` + Run `meds-tab-autogluon model_launcher=autogluon --help` to see all kwargs. Autogluon requires a lot of memory as it makes all the sparse matrices dense, and is not recommended for large datasets. diff --git a/mkdocs.yml b/mkdocs.yml index a613948..8cd2813 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -6,7 +6,7 @@ site_author: Nassim Oufattole nav: - "Home": index.md - "Overview": overview.md - - "Implementation": implementation.md + - "Usage Guide": usage_guide.md - "MIMICIV Tutorial": tutorial.md - "Terminology": terminology.md - "Benchmark Results": prediction.md @@ -64,10 +64,14 @@ markdown_extensions: - pymdownx.superfences - admonition - pymdownx.details + - attr_list extra_javascript: - javascripts/mathjax.js + - javascripts/directory-tree.js - https://unpkg.com/mathjax@3/es5/tex-mml-chtml.js +extra_css: + - stylesheets/directory-tree.css plugins: - search