-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs #36
Conversation
…ormat is still very off though.
WalkthroughThe latest changes primarily enhance the MEDS-Tab system's documentation, install processes, code structure, and functionalities. New documentation files explain various features, including installation, implementation, prediction, and profiling. The Changes
Sequence Diagram(s)No sequence diagrams were generated as the changes are primarily related to documentation, import paths, and configuration updates rather than new features or significant alterations to control flow. Poem
Tip AI model upgrade
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #36 +/- ##
==========================================
+ Coverage 91.72% 91.73% +0.01%
==========================================
Files 13 14 +1
Lines 822 823 +1
==========================================
+ Hits 754 755 +1
Misses 68 68 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Outside diff range and nitpick comments (3)
src/MEDS_tabular_automl/scripts/tabularize_time_series.py (1)
Line range hint
94-102
: Address the loop variable binding issue to prevent potential bugs in asynchronous or concurrent execution.- for shard_fp, window_size, agg in iter_wrapper(tabularization_tasks): + for shard_fp, window_size, agg in iter_wrapper(list(tabularization_tasks)):Tools
Ruff
15-15: Module level import not at top of file (E402)
16-16: Module level import not at top of file (E402)
18-18: Module level import not at top of file (E402)
19-19: Module level import not at top of file (E402)
20-20: Module level import not at top of file (E402)
21-21: Module level import not at top of file (E402)
22-22: Module level import not at top of file (E402)
src/MEDS_tabular_automl/scripts/tabularize_static.py (1)
Line range hint
142-142
: Address the loop variable binding issue to prevent potential bugs in asynchronous or concurrent execution.- for shard_fp, agg in iter_wrapper(tabularization_tasks): + for shard_fp, agg in iter_wrapper(list(tabularization_tasks)):Tools
Ruff
15-15: Module level import not at top of file (E402)
17-22: Module level import not at top of file (E402)
23-23: Module level import not at top of file (E402)
24-24: Module level import not at top of file (E402)
25-25: Module level import not at top of file (E402)
src/MEDS_tabular_automl/scripts/launch_xgboost.py (1)
Line range hint
90-90
: Remove unnecessaryTrue if ... else False
in condition.- code_mask = [True if idx in codes_set else False for idx in feature_ids] + code_mask = [idx in codes_set for idx in feature_ids]
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (3)
docs/assets/main_fig.png
is excluded by!**/*.png
docs/assets/meds_logo.png
is excluded by!**/*.png
docs/assets/pivot.png
is excluded by!**/*.png
Files selected for processing (17)
- .gitignore (2 hunks)
- README.md (3 hunks)
- docs/make.bat (2 hunks)
- docs/requirements.txt (1 hunks)
- docs/source/conf.py (3 hunks)
- docs/source/implementation.md (1 hunks)
- docs/source/index.md (1 hunks)
- docs/source/installation.md (1 hunks)
- docs/source/overview.md (1 hunks)
- docs/source/prediction.md (1 hunks)
- docs/source/profiling.md (1 hunks)
- src/MEDS_tabular_automl/scripts/cache_task.py (1 hunks)
- src/MEDS_tabular_automl/scripts/describe_codes.py (1 hunks)
- src/MEDS_tabular_automl/scripts/launch_xgboost.py (1 hunks)
- src/MEDS_tabular_automl/scripts/tabularize_static.py (1 hunks)
- src/MEDS_tabular_automl/scripts/tabularize_time_series.py (1 hunks)
- src/MEDS_tabular_automl/utils.py (1 hunks)
Files skipped from review due to trivial changes (5)
- .gitignore
- docs/source/prediction.md
- docs/source/profiling.md
- src/MEDS_tabular_automl/scripts/cache_task.py
- src/MEDS_tabular_automl/scripts/describe_codes.py
Additional context used
Markdownlint
docs/source/implementation.md
3-3: Expected: h2; Actual: h4 (MD001, heading-increment)
Heading levels should only increment by one level at a timedocs/source/installation.md
19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
31-31: null (MD040, fenced-code-language)
Fenced code blocks should have a language specifieddocs/source/overview.md
14-14: Expected: h2; Actual: h3 (MD001, heading-increment)
Heading levels should only increment by one level at a time
33-33: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
41-41: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
53-53: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
68-68: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
79-79: null (MD040, fenced-code-language)
Fenced code blocks should have a language specifiedREADME.md
19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
36-36: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
150-150: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
167-167: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
169-169: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
31-31: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
68-68: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
76-76: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
88-88: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
103-103: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
114-114: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
LanguageTool
docs/source/index.md
[style] ~48-~48: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...ithin the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...
[style] ~48-~48: Using many exclamation marks might seem excessive (in this case: 8 exclamation marks for a text that’s 2711 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!docs/source/overview.md
[uncategorized] ~24-~24: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...
[uncategorized] ~37-~37: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...``` 2.meds-tab-tabularize-static
: Filters and processes the dataset based...
[typographical] ~37-~37: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...
[uncategorized] ~49-~49: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3.meds-tab-tabularize-time-series
: Iterates through combinations of a shar...
[uncategorized] ~64-~64: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4.meds-tab-cache-task
: Aligns task-specific labels with the ne...
[grammar] ~66-~66: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...
[uncategorized] ~77-~77: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5.meds-tab-xgboost
: Trains an XGBoost model using user-spec...
[uncategorized] ~77-~77: You might be missing the article “the” here. (AI_EN_LECTOR_MISSING_DETERMINER_THE)
Context: ...izesand
aggscan be generated using
generate-permutations` command (See the ...
[uncategorized] ~90-~90: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... ``` 6.meds-tab-xgboost-sweep
: Conducts an Optuna hyperparameter sweep...
[uncategorized] ~94-~94: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1.generate-permutations
: Generates and prints a sorted list of a...
[typographical] ~96-~96: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...
[uncategorized] ~96-~96: The preposition “on” seems more likely in this position than the preposition “in”. (AI_EN_LECTOR_REPLACEMENT_PREPOSITION_IN_ON)
Context: ...rectly callgenerate-permutations
in the command line: ```bash genera...README.md
[uncategorized] ~59-~59: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...
[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...``` 2.meds-tab-tabularize-static
: Filters and processes the dataset based...
[typographical] ~72-~72: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...
[uncategorized] ~84-~84: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3.meds-tab-tabularize-time-series
: Iterates through combinations of a shar...
[uncategorized] ~99-~99: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4.meds-tab-cache-task
: Aligns task-specific labels with the ne...
[grammar] ~101-~101: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...
[uncategorized] ~112-~112: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5.meds-tab-xgboost
: Trains an XGBoost model using user-spec...
[uncategorized] ~125-~125: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... ``` 6.meds-tab-xgboost-sweep
: Conducts an Optuna hyperparameter sweep...
[uncategorized] ~129-~129: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1.generate-permutations
: Generates and prints a sorted list of a...
[typographical] ~131-~131: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...
Ruff
src/MEDS_tabular_automl/scripts/tabularize_time_series.py
8-8: Module level import not at top of file (E402)
9-9: Module level import not at top of file (E402)
10-10: Module level import not at top of file (E402)
11-11: Module level import not at top of file (E402)
13-13: Module level import not at top of file (E402)
14-14: Module level import not at top of file (E402)
15-15: Module level import not at top of file (E402)
16-16: Module level import not at top of file (E402)
18-18: Module level import not at top of file (E402)
19-19: Module level import not at top of file (E402)
20-20: Module level import not at top of file (E402)
21-21: Module level import not at top of file (E402)
22-22: Module level import not at top of file (E402)
23-30: Module level import not at top of file (E402)
94-94: Function definition does not bind loop variable
agg
(B023)
101-101: Function definition does not bind loop variable
window_size
(B023)
102-102: Function definition does not bind loop variable
agg
(B023)src/MEDS_tabular_automl/scripts/tabularize_static.py
13-13: Module level import not at top of file (E402)
15-15: Module level import not at top of file (E402)
17-22: Module level import not at top of file (E402)
23-23: Module level import not at top of file (E402)
24-24: Module level import not at top of file (E402)
25-25: Module level import not at top of file (E402)
26-34: Module level import not at top of file (E402)
142-142: Function definition does not bind loop variable
agg
(B023)docs/source/conf.py
61-61: Module level import not at top of file (E402)
src/MEDS_tabular_automl/scripts/launch_xgboost.py
90-90: Remove unnecessary
True if ... else False
(SIM210)Remove unnecessary
True if ... else False
Additional comments not posted (15)
docs/requirements.txt (1)
4-4
: Ensure the new dependenciessphinx_immaterial
andesgpt
are compatible with the project's existing environment.Also applies to: 14-14
docs/make.bat (1)
13-14
: Good addition of a conditional check to guide users when no arguments are provided.src/MEDS_tabular_automl/scripts/tabularize_time_series.py (1)
18-23
: Ensure the relative imports are correctly resolving the modules.Tools
Ruff
18-18: Module level import not at top of file (E402)
19-19: Module level import not at top of file (E402)
20-20: Module level import not at top of file (E402)
21-21: Module level import not at top of file (E402)
22-22: Module level import not at top of file (E402)
src/MEDS_tabular_automl/scripts/tabularize_static.py (1)
17-26
: Ensure the relative imports are correctly resolving the modules.Verification successful
The relative imports in the
src/MEDS_tabular_automl/scripts/tabularize_static.py
file are consistent and correctly resolve the modules as they are used across multiple files in the project without any issues.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the resolution of relative imports after refactoring. # Test: Search for the usage of these modules in the project to ensure no unresolved imports. fd --exec rg 'from ..describe_codes import' fd --exec rg 'from ..file_name import' fd --exec rg 'from ..generate_static_features import' fd --exec rg 'from ..mapper import' fd --exec rg 'from ..utils import'Length of output: 6598
Tools
Ruff
17-22: Module level import not at top of file (E402)
23-23: Module level import not at top of file (E402)
24-24: Module level import not at top of file (E402)
25-25: Module level import not at top of file (E402)
README.md (3)
49-57
: The changes in the "Scripts and Examples" section enhance the clarity and usefulness of the documentation. Good job!
127-127
: The "Additional CLI Scripts" section is well-written and provides useful information on the new CLI script.
152-152
: The explanation of "tabular pipelines" is insightful and helps clarify common misconceptions about structured EHR data.docs/source/conf.py (5)
15-18
: The update to the project information, including the new version number and additional authors, is accurate and up-to-date.
85-96
: The addition of new Sphinx extensions will enhance the documentation capabilities, providing more features such as todo items, code views, and coverage reports.
100-100
: The change to the "pydata_sphinx_theme" is a good choice for enhancing the visual appeal and functionality of the documentation.
175-175
: The simplification of the source suffix to only include ".rst" and ".md" makes the documentation sources more consistent.
135-135
: Adding common build and system files to the exclude patterns is a good practice to keep the documentation build clean.src/MEDS_tabular_automl/utils.py (1)
81-81
: The addition ofreplace=True
in theOmegaConf.register_new_resolver
call is a good practice to ensure the resolver is correctly registered without conflicts.src/MEDS_tabular_automl/scripts/launch_xgboost.py (2)
16-18
: Relative imports improve modularity and maintainability of the codebase.
16-18
: Ensure that the relative paths are correctly resolving to the intended modules.Verification successful
The relative import paths in
launch_xgboost.py
are correctly resolving to the intended modules.
get_feature_columns
is imported fromdescribe_codes
.get_model_files
andlist_subdir_files
are defined infile_name.py
.get_feature_indices
andhydra_loguru_init
are defined inutils.py
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify relative import paths resolve correctly. # Test: Search for the function usage. Expect: Only occurances of the new signature. ast-grep --lang python --pattern $'from ..describe_codes import get_feature_columns'Length of output: 189
Script:
#!/bin/bash # Description: Verify relative import paths resolve correctly. # Verify the existence of `get_model_files` and `list_subdir_files` in `file_name.py` ast-grep --lang python --pattern $'def get_model_files($_) { $$$ }' | grep 'file_name.py' ast-grep --lang python --pattern $'def list_subdir_files($_) { $$$ }' | grep 'file_name.py' # Verify the existence of `get_feature_indices` and `hydra_loguru_init` in `utils.py` ast-grep --lang python --pattern $'def get_feature_indices($_) { $$$ }' | grep 'utils.py' ast-grep --lang python --pattern $'def hydra_loguru_init($_) { $$$ }' | grep 'utils.py'Length of output: 2179
Script:
#!/bin/bash # Description: Verify relative import paths resolve correctly. # Verify the existence of `get_model_files` and `list_subdir_files` in `file_name.py` ast-grep --lang python --pattern 'def get_model_files' | grep 'file_name.py' ast-grep --lang python --pattern 'def list_subdir_files' | grep 'file_name.py' # Verify the existence of `get_feature_indices` and `hydra_loguru_init` in `utils.py` ast-grep --lang python --pattern 'def get_feature_indices' | grep 'utils.py' ast-grep --lang python --pattern 'def hydra_loguru_init' | grep 'utils.py'Length of output: 1687
Script:
#!/bin/bash # Description: Verify relative import paths resolve correctly. # Verify the existence of `get_model_files` and `list_subdir_files` in `file_name.py` rg 'def get_model_files' | grep 'file_name.py' rg 'def list_subdir_files' | grep 'file_name.py' # Verify the existence of `get_feature_indices` and `hydra_loguru_init` in `utils.py` rg 'def get_feature_indices' | grep 'utils.py' rg 'def hydra_loguru_init' | grep 'utils.py'Length of output: 577
docs/source/index.md
Outdated
|
||
- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space! | ||
|
||
By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider rephrasing for conciseness and clarity.
- This approach not only simplifies the process but also ensures high-quality, reproducible results for ...
+ This approach simplifies the process and ensures high-quality, reproducible results for ...
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations! | |
By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach simplifies the process and ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations! |
Tools
LanguageTool
[style] ~48-~48: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...ithin the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...
[style] ~48-~48: Using many exclamation marks might seem excessive (in this case: 8 exclamation marks for a text that’s 2711 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!
docs/source/implementation.md
Outdated
@@ -0,0 +1,14 @@ | |||
# How does MEDS-Tab Work? | |||
|
|||
#### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjust the heading level to increment by one from the previous level for better readability and structure.
-#### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?
+### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
#### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular? | |
### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular? |
Tools
Markdownlint
3-3: Expected: h2; Actual: h4 (MD001, heading-increment)
Heading levels should only increment by one level at a time
docs/source/installation.md
Outdated
|
||
**Local Install** | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify the language for fenced code blocks to enhance readability and syntax highlighting.
-```
+```bash
Tools
Markdownlint
31-31: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
README.md
Outdated
|
||
See `tests/test_integration.py` for an example of the end-to-end pipeline being run on synthetic data. This | ||
script is a functional test that is also run with `pytest` to verify the correctness of the algorithm. | ||
|
||
For an end to end example over MIMIC-IV, see the [companion repository](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV) | ||
For an end to end example over Philips eICU, see the [eICU companion repository](https://github.com/mmcdermott/MEDS_TAB_EICU). | ||
|
||
### Core CLI Scripts Overview | ||
## Core CLI Scripts Overview | ||
|
||
1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review the punctuation and grammar in the descriptions of the CLI scripts to ensure clarity and professionalism.
- **`meds-tab-describe`**: This command processes MEDS data shards...
+ **`meds-tab-describe`**: This command processes MEDS data shards...
- **`meds-tab-tabularize-static`**: Filters and processes the dataset based...
+ **`meds-tab-tabularize-static`**: Filters and processes the dataset based...
- **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard...
+ **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard...
- **`meds-tab-xgboost`**: Trains an XGBoost model using user-specific parameters.
+ **`meds-tab-xgboost`**: Trains an XGBoost model using user-specific parameters.
- **`meds-tab-xgboost-sweep`**: Conducts an Optuna hyperparameter sweep...
+ **`meds-tab-xgboost-sweep`**: Conducts an Optuna hyperparameter sweep...
- **`generate-permutations`**: Generates and prints a sorted list of all permutations...
+ **`generate-permutations`**: Generates and prints a sorted list of all permutations...
Also applies to: 72-72, 84-84, 99-99, 112-112, 125-125, 129-129
Tools
LanguageTool
[uncategorized] ~59-~59: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...
|
||
### Additional CLI Scripts | ||
|
||
1. **`generate-permutations`**: Generates and prints a sorted list of all permutations from a comma separated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address markdown linting issues by specifying a language for fenced code blocks.
- ```bash
+ ```bash
Tools
LanguageTool
[uncategorized] ~94-~94: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1.generate-permutations
: Generates and prints a sorted list of a...
docs/source/overview.md
Outdated
|
||
4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`. | ||
|
||
**Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the grammar to improve readability.
- Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
+ Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
Committable suggestion was skipped due to low confidence.
Tools
LanguageTool
[grammar] ~66-~66: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- docs/requirements.txt (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- docs/requirements.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Outside diff range and nitpick comments (1)
README.md (1)
Line range hint
137-137
: Add a comma after "for example" for grammatical correctness.- For example you can directly call **`generate-permutations`** in the command line: + For example, you can directly call **`generate-permutations`** in the command line:Tools
LanguageTool
[uncategorized] ~69-~69: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...
[uncategorized] ~78-~78: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2.meds-tab-tabularize-static
: Filters and processes the dataset based...
[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...
[uncategorized] ~92-~92: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3.meds-tab-tabularize-time-series
: Iterates through combinations of a shar...Markdownlint
47-47: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
53-53: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (3)
- .readthedocs.yaml (1 hunks)
- README.md (6 hunks)
- docs/source/index.md (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- .readthedocs.yaml
Additional context used
LanguageTool
docs/source/index.md
[style] ~34-~34: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...
[style] ~34-~34: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 1704 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!README.md
[uncategorized] ~69-~69: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...
[uncategorized] ~78-~78: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2.meds-tab-tabularize-static
: Filters and processes the dataset based...
[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...
[uncategorized] ~92-~92: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3.meds-tab-tabularize-time-series
: Iterates through combinations of a shar...
[uncategorized] ~107-~107: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4.meds-tab-cache-task
: Aligns task-specific labels with the ne...
[grammar] ~109-~109: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...
[uncategorized] ~120-~120: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5.meds-tab-xgboost
: Trains an XGBoost model using user-spec...
[uncategorized] ~135-~135: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1.generate-permutations
: Generates and prints a sorted list of a...
[typographical] ~137-~137: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...
[style] ~170-~170: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...
[typographical] ~202-~202: It appears that a comma is missing. (DURING_THAT_TIME_COMMA)
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...
[uncategorized] ~208-~208: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...
[style] ~265-~265: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative. (A_LOT_OF)
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...
[style] ~294-~294: Consider using a different verb to strengthen your wording. (SPEED_UP_ACCELERATE)
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...
[uncategorized] ~330-~330: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... more memory efficient version of their method which we denotecatabra-mem
. Other li...
[uncategorized] ~330-~330: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...
[style] ~334-~334: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...
[style] ~334-~334: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...
[uncategorized] ~334-~334: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...
[uncategorized] ~462-~462: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...
Markdownlint
README.md
19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
36-36: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
185-185: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
309-309: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
400-400: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
47-47: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
53-53: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
520-520: null (MD038, no-space-in-code)
Spaces inside code span elements
31-31: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
215-215: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
276-276: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
414-414: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
484-484: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
Additional comments not posted (4)
docs/source/index.md (3)
3-3
: Ensure the introduction is engaging and clearly states the purpose of MEDS-Tab.
5-15
: The table of contents is well-structured and provides a clear path for navigating the documentation.
19-21
: This section effectively highlights the benefits of using MEDS-Tab.README.md (1)
43-45
: Ensure consistency in documentation.The installation instructions are duplicated here which might confuse users. Consider removing this section or integrating it with the primary installation instructions at the beginning of the document.
docs/source/index.md
Outdated
- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions | ||
- Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bullet points here succinctly communicate the advantages of MEDS-Tab. Consider adding specific metrics or examples to further bolster these claims.
- - Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks.
+ - Strong Performance: MEDS-Tab consistently outperforms other frameworks in benchmarks such as [specific benchmark], demonstrating its robustness and high performance.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions | |
- Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks. | |
- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions | |
- Strong Performance: MEDS-Tab consistently outperforms other frameworks in benchmarks such as [specific benchmark], demonstrating its robustness and high performance. |
README.md
Outdated
1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date. | ||
2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrase the description for clarity and conciseness.
- 1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date.
+ 1. Efficient construction and loading of flat, non-longitudinal summary features for patient records in MEDS, covering arbitrary time-windows (e.g., 1 year, 6 months) backward from a specified index date.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date. | |
2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems. | |
1. Efficient construction and loading of flat, non-longitudinal summary features for patient records in MEDS, covering arbitrary time-windows (e.g., 1 year, 6 months) backward from a specified index date. | |
2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems. |
README.md
Outdated
@@ -98,9 +106,9 @@ | |||
|
|||
4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`. | |||
|
|||
**Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES) | |||
**Example: Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct grammatical error in the description.
- **Example: Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)
+ **Example: Align Tabularized Data**: For a specific task `$TASK`, align labels that have been pulled from [ACES](https://github.com/justin13601/ACES).
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
**Example: Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES) | |
**Example: Align Tabularized Data**: For a specific task `$TASK`, align labels that have been pulled from [ACES](https://github.com/justin13601/ACES). |
Tools
LanguageTool
[grammar] ~109-~109: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...
README.md
Outdated
2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient. | ||
|
||
**Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]` | ||
**Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify and simplify the example description.
- **Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
+ **Example: Tabularizing Static Data**: Minimum code frequency: 10, Window sizes: `[1d, 30d, 365d, full]`, Aggregation methods: `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient. | |
**Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]` | |
**Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]` | |
2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient. | |
**Example: Tabularizing Static Data**: Minimum code frequency: 10, Window sizes: `[1d, 30d, 365d, full]`, Aggregation methods: `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]` |
Tools
LanguageTool
[uncategorized] ~78-~78: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2.meds-tab-tabularize-static
: Filters and processes the dataset based...
[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 14
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (4)
- docs/source/implementation.md (1 hunks)
- docs/source/overview.md (1 hunks)
- docs/source/prediction.md (1 hunks)
- docs/source/profiling.md (1 hunks)
Additional context used
LanguageTool
docs/source/implementation.md
[typographical] ~3-~3: It appears that a comma is missing. (DURING_THAT_TIME_COMMA)
Context: ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...
[uncategorized] ~9-~9: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...
[style] ~66-~66: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative. (A_LOT_OF)
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...
[uncategorized] ~74-~74: A determiner appears to be missing. Consider inserting it. (AI_EN_LECTOR_MISSING_DETERMINER)
Context: ...ow sizes. 5. Output Storage: - Sparse array is converted to Coordinate List f...
[style] ~95-~95: Consider using a different verb to strengthen your wording. (SPEED_UP_ACCELERATE)
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...docs/source/profiling.md
[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... more memory efficient version of their method which we denotecatabra-mem
. Other li...
[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...
[style] ~26-~26: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...
[style] ~26-~26: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...
[uncategorized] ~26-~26: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... thatcatabra-mem
is omitted from the tables as it never completed within the 10 min...
[uncategorized] ~26-~26: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...docs/source/overview.md
[uncategorized] ~34-~34: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...
[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2.meds-tab-tabularize-static
: Filters and processes the dataset based...
[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...
[uncategorized] ~57-~57: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3.meds-tab-tabularize-time-series
: Iterates through combinations of a shar...
[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4.meds-tab-cache-task
: Aligns task-specific labels with the ne...
[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...
[uncategorized] ~85-~85: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5.meds-tab-xgboost
: Trains an XGBoost model using user-spec...
[uncategorized] ~85-~85: You might be missing the article “the” here. (AI_EN_LECTOR_MISSING_DETERMINER_THE)
Context: ...izesand
aggscan be generated using
generate-permutations` command (See the ...
[uncategorized] ~100-~100: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1.generate-permutations
: Generates and prints a sorted list of a...
[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...
[uncategorized] ~102-~102: The preposition “on” seems more likely in this position than the preposition “in”. (AI_EN_LECTOR_REPLACEMENT_PREPOSITION_IN_ON)
Context: ...rectly callgenerate-permutations
in the command line: ```bash genera...
[uncategorized] ~125-~125: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...
[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...docs/source/prediction.md
[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...
Markdownlint
docs/source/implementation.md
16-16: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
77-77: null (MD040, fenced-code-language)
Fenced code blocks should have a language specifieddocs/source/profiling.md
90-90: null (MD047, single-trailing-newline)
Files should end with a single newline characterdocs/source/overview.md
150-150: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a headingdocs/source/prediction.md
121-121: null (MD038, no-space-in-code)
Spaces inside code span elements
15-15: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
85-85: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
docs/source/implementation.md
Outdated
|
||
This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients: | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify language for fenced code blocks to adhere to Markdown best practices.
- ```
+ ```plaintext
Also applies to: 77-77
Tools
Markdownlint
16-16: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
docs/source/implementation.md
Outdated
|
||
1. Describe codes (compute feature frequencies) | ||
2. Given time series data tabularize it | ||
3. cache task specific rows of data for efficient loading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hyphenate "task-specific" for grammatical correctness.
- 3. cache task specific rows of data for efficient loading
+ 3. cache task-specific rows of data for efficient loading
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
3. cache task specific rows of data for efficient loading | |
3. cache task-specific rows of data for efficient loading |
Tools
LanguageTool
[uncategorized] ~9-~9: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...
|
||
5. **Output Storage**: | ||
|
||
- Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insert "a" before "Sparse array" to correct the determiner omission.
- Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.
+ A Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk. | |
- A Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk. |
Tools
LanguageTool
[uncategorized] ~74-~74: A determiner appears to be missing. Consider inserting it. (AI_EN_LECTOR_MISSING_DETERMINER)
Context: ...ow sizes. 5. Output Storage: - Sparse array is converted to Coordinate List f...
docs/source/profiling.md
Outdated
|
||
## MEDS-Tab Tabularization Technique | ||
|
||
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMICIV and EICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for EICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a comma after "method" for better readability.
- ...more memory efficient version of their method which we denote `catabra-mem`.
+ ...more memory efficient version of their method, which we denote `catabra-mem`.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMICIV and EICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for EICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below. | |
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method, which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMICIV and EICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for EICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below. |
Tools
LanguageTool
[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... more memory efficient version of their method which we denotecatabra-mem
. Other li...
[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...
docs/source/profiling.md
Outdated
| --------- | ----------- | ----------- | ----------- | -------- | | ||
| 0m15.867s | 1,410.79 MB | 3,539.32 MB | 442 MB | meds_tab | | ||
|
||
______________________________________________________________________ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure the file ends with a single newline character.
+ \n
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
______________________________________________________________________ | |
______________________________________________________________________ | |
Tools
Markdownlint
90-90: null (MD047, single-trailing-newline)
Files should end with a single newline character
docs/source/overview.md
Outdated
3. We need to ensure full and seamless compatibility with the ACES CLI tool, rather than relying on the python API and manual adjustments: | ||
[#34](https://github.com/mmcdermott/MEDS_Tabular_AutoML/issues/34) | ||
|
||
# How does MEDS-Tab Work? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid multiple top-level headings in the same document to adhere to Markdown best practices.
- # How does MEDS-Tab Work?
+ ## How does MEDS-Tab Work?
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# How does MEDS-Tab Work? | |
## How does MEDS-Tab Work? |
Tools
Markdownlint
150-150: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
docs/source/prediction.md
Outdated
|
||
A single XGBoost run was completed to profile time and memory usage. This was done for each `$TASK` using the following command: | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify language for fenced code blocks to adhere to Markdown best practices.
- ```
+ ```bash
Also applies to: 85-85
Tools
Markdownlint
15-15: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
docs/source/prediction.md
Outdated
| LOS in Hospital > 3 days | Admission + 24 hr | 6m4.884s | 7m5.025s | 1m4.335s | 11011.710 | 12223.449 | | ||
| LOS in Hospital > 3 days | Admission + 48 hr | 6m9.587s | 7m12.853s | 1m3.858s | 10703.064 | 11829.742 | | ||
|
||
#### 1.1.2 MIMIC-IV Task Specific Training Cohort Size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hyphenate "Task-Specific" for grammatical correctness.
- #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size
+ #### 1.1.2 MIMIC-IV Task-Specific Training Cohort Size
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
#### 1.1.2 MIMIC-IV Task Specific Training Cohort Size | |
#### 1.1.2 MIMIC-IV Task-Specific Training Cohort Size |
Tools
LanguageTool
[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- docs/source/prediction.md (1 hunks)
Files not reviewed due to errors (1)
- docs/source/prediction.md (no review received)
Additional context used
LanguageTool
docs/source/prediction.md
[uncategorized] ~58-~58: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...
[uncategorized] ~186-~186: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ... | 14 | #### 2.1.3 eICU Task Specific Training Cohort Size | Task ...
Markdownlint
docs/source/prediction.md
2-2: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces
4-4: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces
42-42: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces
110-110: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces
113-113: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces
132-132: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces
152-152: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces
112-112: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines
113-113: Expected: 1; Actual: 3 (MD012, no-multiple-blanks)
Multiple consecutive blank lines
114-114: Expected: 1; Actual: 4 (MD012, no-multiple-blanks)
Multiple consecutive blank lines
134-134: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines
185-185: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines
198-198: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines
199-199: Expected: 1; Actual: 3 (MD012, no-multiple-blanks)
Multiple consecutive blank lines
76-76: null (MD019, no-multiple-space-atx)
Multiple spaces after hash on atx style heading
43-43: Expected: 1; Actual: 0; Above (MD022, blanks-around-headings)
Headings should be surrounded by blank lines
43-43: Expected: 1; Actual: 0; Below (MD022, blanks-around-headings)
Headings should be surrounded by blank lines
135-135: Expected: 1; Actual: 0; Below (MD022, blanks-around-headings)
Headings should be surrounded by blank lines
152-152: Expected: 1; Actual: 0; Below (MD022, blanks-around-headings)
Headings should be surrounded by blank lines
153-153: Expected: 1; Actual: 0; Above (MD022, blanks-around-headings)
Headings should be surrounded by blank lines
186-186: Expected: 1; Actual: 0; Below (MD022, blanks-around-headings)
Headings should be surrounded by blank lines
15-15: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines
21-21: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines
23-23: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines
41-41: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines
79-79: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines
87-87: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines
89-89: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines
96-96: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines
98-98: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines
109-109: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines
132-132: null (MD032, blanks-around-lists)
Lists should be surrounded by blank lines
110-110: null (MD038, no-space-in-code)
Spaces inside code span elements
15-15: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
79-79: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (4)
- docs/source/index.md (1 hunks)
- docs/source/overview.md (1 hunks)
- docs/source/prediction.md (1 hunks)
- docs/source/profiling.md (1 hunks)
Files not reviewed due to errors (1)
- docs/source/prediction.md (no review received)
Additional context used
LanguageTool
docs/source/index.md
[style] ~34-~34: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...
[style] ~34-~34: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 1704 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!docs/source/profiling.md
[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...d that on the MIMIC-IV and eICU medical datasets we significantly outperform past method...
[style] ~26-~26: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...
[style] ~26-~26: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...
[uncategorized] ~26-~26: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...docs/source/overview.md
[uncategorized] ~34-~34: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...
[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2.meds-tab-tabularize-static
: Filters and processes the dataset based...
[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...
[uncategorized] ~57-~57: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3.meds-tab-tabularize-time-series
: Iterates through combinations of a shar...
[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4.meds-tab-cache-task
: Aligns task-specific labels with the ne...
[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...
[uncategorized] ~85-~85: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5.meds-tab-xgboost
: Trains an XGBoost model using user-spec...
[uncategorized] ~100-~100: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1.generate-permutations
: Generates and prints a sorted list of a...
[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...
[uncategorized] ~125-~125: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...
[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...docs/source/prediction.md
[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...
[uncategorized] ~195-~195: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ... | 14 | #### 2.1.3 eICU Task Specific Training Cohort Size | Task ...
Markdownlint
docs/source/overview.md
12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a headingdocs/source/prediction.md
121-121: null (MD038, no-space-in-code)
Spaces inside code span elements
15-15: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
85-85: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
Additional comments not posted (2)
docs/source/index.md (1)
25-27
: The bullet points here succinctly communicate the advantages of MEDS-Tab. Consider adding specific metrics or examples to further bolster these claims.- - Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks. + - Strong Performance: MEDS-Tab consistently outperforms other frameworks in benchmarks such as [specific benchmark], demonstrating its robustness and high performance.Likely invalid or redundant comment.
docs/source/overview.md (1)
74-74
: Correct the grammar to improve readability.- Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`. + Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.Likely invalid or redundant comment.
Tools
LanguageTool
[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...
docs/source/index.md
Outdated
MEDS-Tab is a library designed for automated tabularization, data preparation with aggregation, and windowing. Check out below for an overview of MEDS-Tab and how it could be useful in your workflows! | ||
|
||
```{toctree} | ||
--- | ||
glob: | ||
maxdepth: 2 | ||
--- | ||
Overview <overview> | ||
Pipeline/Architecture Details <implementation> | ||
Computational Profile <profiling> | ||
Prediction Performance <prediction> | ||
Module API Reference <api/modules> | ||
``` | ||
|
||
______________________________________________________________________ | ||
|
||
## Why MEDS-Tab? | ||
|
||
MEDS-Tab is a comprehensive framework designed to streamline the handling, modeling, and analysis of complex medical time-series data. By leveraging automated processes, MEDS-Tab significantly reduces the computation required to generate high-quality baseline models for diverse supervised learning tasks. | ||
|
||
### I. Transform to MEDS | ||
|
||
- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions | ||
- Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks. | ||
|
||
### II. Run MEDS-Tab | ||
|
||
- Run the MEDS-Tab Command-Line Interface tool (`MEDS-Tab-cli`) to extract cohorts based on your task - check out the [Usage Guide](https://meds-tab--36.org.readthedocs.build/en/36/overview.html#core-cli-scripts-overview)! | ||
|
||
- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space! | ||
|
||
By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider rephrasing for conciseness and clarity.
- This approach not only simplifies the process but also ensures high-quality, reproducible results for ...
+ This approach simplifies the process and ensures high-quality, reproducible results for ...
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
MEDS-Tab is a library designed for automated tabularization, data preparation with aggregation, and windowing. Check out below for an overview of MEDS-Tab and how it could be useful in your workflows! | |
```{toctree} | |
--- | |
glob: | |
maxdepth: 2 | |
--- | |
Overview <overview> | |
Pipeline/Architecture Details <implementation> | |
Computational Profile <profiling> | |
Prediction Performance <prediction> | |
Module API Reference <api/modules> | |
``` | |
______________________________________________________________________ | |
## Why MEDS-Tab? | |
MEDS-Tab is a comprehensive framework designed to streamline the handling, modeling, and analysis of complex medical time-series data. By leveraging automated processes, MEDS-Tab significantly reduces the computation required to generate high-quality baseline models for diverse supervised learning tasks. | |
### I. Transform to MEDS | |
- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions | |
- Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks. | |
### II. Run MEDS-Tab | |
- Run the MEDS-Tab Command-Line Interface tool (`MEDS-Tab-cli`) to extract cohorts based on your task - check out the [Usage Guide](https://meds-tab--36.org.readthedocs.build/en/36/overview.html#core-cli-scripts-overview)! | |
- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space! | |
By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations! | |
By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the MEDS-Tab ecosystem. This approach simplifies the process and ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations! |
Tools
LanguageTool
[style] ~34-~34: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...
[style] ~34-~34: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 1704 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!
docs/source/profiling.md
Outdated
|
||
## 2. Comparative Performance Analysis | ||
|
||
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace "very" with a more precise adjective.
- ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly.
+ ...use a budget of 10 minutes as these are a small number of patients (10, 100, and 500 patients), and should be processed quickly.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget. | |
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are a small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget. |
Tools
LanguageTool
[style] ~26-~26: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...
[style] ~26-~26: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...
[uncategorized] ~26-~26: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...
docs/source/profiling.md
Outdated
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMIC-IV and eICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below. | ||
|
||
## 2. Comparative Performance Analysis | ||
|
||
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove "of" after "all" for conciseness.
- ...ing the better performance of MEDS-Tab in all of the scenarios.
+ ...ing the better performance of MEDS-Tab in all the scenarios.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMIC-IV and eICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below. | |
## 2. Comparative Performance Analysis | |
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget. | |
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMIC-IV and eICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below. | |
## 2. Comparative Performance Analysis | |
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget. |
Tools
LanguageTool
[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...d that on the MIMIC-IV and eICU medical datasets we significantly outperform past method...
[style] ~26-~26: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...
[style] ~26-~26: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...
[uncategorized] ~26-~26: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- README.md (3 hunks)
- docs/source/prediction.md (1 hunks)
Files not reviewed due to errors (1)
- docs/source/prediction.md (no review received)
Additional context used
LanguageTool
docs/source/prediction.md
[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ... 11829.742 | #### 1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...
[uncategorized] ~195-~195: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ... | 14 | #### 3. eICU Task Specific Training Cohort Size | Task ...README.md
[uncategorized] ~69-~69: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...
[uncategorized] ~78-~78: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2.meds-tab-tabularize-static
: Filters and processes the dataset based...
[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...
[uncategorized] ~92-~92: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3.meds-tab-tabularize-time-series
: Iterates through combinations of a shar...
[uncategorized] ~107-~107: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4.meds-tab-cache-task
: Aligns task-specific labels with the ne...
[grammar] ~109-~109: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...
[uncategorized] ~120-~120: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5.meds-tab-xgboost
: Trains an XGBoost model using user-spec...
[uncategorized] ~135-~135: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1.generate-permutations
: Generates and prints a sorted list of a...
[typographical] ~137-~137: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...
[uncategorized] ~160-~160: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...
[style] ~170-~170: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...
[typographical] ~202-~202: It appears that a comma is missing. (DURING_THAT_TIME_COMMA)
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...
[uncategorized] ~208-~208: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...
[style] ~265-~265: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative. (A_LOT_OF)
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...
[style] ~294-~294: Consider using a different verb to strengthen your wording. (SPEED_UP_ACCELERATE)
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...
[uncategorized] ~330-~330: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...
[style] ~334-~334: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...
[style] ~334-~334: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...
[uncategorized] ~334-~334: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... thatcatabra-mem
is omitted from the tables as it never completed within the 10 min...
[uncategorized] ~334-~334: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...
[uncategorized] ~462-~462: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...
Markdownlint
docs/source/prediction.md
121-121: null (MD038, no-space-in-code)
Spaces inside code span elementsREADME.md
19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
36-36: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
185-185: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
309-309: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
400-400: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
47-47: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
53-53: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
520-520: null (MD038, no-space-in-code)
Spaces inside code span elements
215-215: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
276-276: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
414-414: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
484-484: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified
Additional comments not posted (9)
README.md (9)
25-25
: Installation command is clear and concise.This line provides a straightforward command for installing the package via pip, which is crucial for user accessibility.
31-31
: Local installation instructions are clear.The instructions for cloning the repository and installing the package locally are clear, promoting easy setup for developers.
76-76
: Explanation of caching feature names and frequencies is clear.The documentation clearly explains how the
meds-tab-describe
script computes and caches feature names and frequencies, which is essential for understanding the data processing workflow.
40-42
: Clarify the description of tabular feature construction and usage.The description of constructing and using tabular features can be enhanced for better clarity and precision.
- 1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date. + 1. Efficient construction and loading of flat, non-longitudinal summary features for patient records in MEDS, covering arbitrary time-windows (e.g., 1 year, 6 months) backward from a specified index date.
80-80
: Example command for static data tabularization needs clarification.The example command provided for static data tabularization is detailed but could be simplified for better readability.
- **Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]` + **Example: Tabularizing Static Data**: Minimum code frequency: 10, Window sizes: `[1d, 30d, 365d, full]`, Aggregation methods: `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
109-109
: Grammar correction needed in example description.The description has a grammatical error that needs correction for clarity.
- **Example: Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES) + **Example: Align Tabularized Data**: For a specific task `$TASK`, align labels that have been pulled from [ACES](https://github.com/justin13601/ACES).Tools
LanguageTool
[grammar] ~109-~109: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...
133-133
: Clarify the functionality of thegenerate-permutations
command.The description of the
generate-permutations
command can be improved for better understanding.- 1. **`generate-permutations`**: Generates and prints a sorted list of all permutations from a comma separated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations. + 1. **`generate-permutations`**: Generates and prints all permutations from a comma-separated input, aiding in sweeping over combinations of window sizes and aggregations.
170-170
: Simplify language for clarity.The language used in the description can be simplified for better readability.
- We should likely decorrelate the default aggregations and/or window sizes we use prior to passing them into the models as features. + We should likely decorrelate the default aggregations and window sizes before using them as features in the models.Tools
LanguageTool
[style] ~170-~170: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...
202-202
: Insert a comma for correct punctuation.A comma is missing in the sentence, which is necessary for correct punctuation and readability.
- In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process. + In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.Tools
LanguageTool
[typographical] ~202-~202: It appears that a comma is missing. (DURING_THAT_TIME_COMMA)
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- docs/source/installation.md (1 hunks)
- docs/source/overview.md (1 hunks)
Additional context used
Markdownlint
docs/source/installation.md
19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a headingdocs/source/overview.md
12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
LanguageTool
docs/source/overview.md
[uncategorized] ~34-~34: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...
[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2.meds-tab-tabularize-static
: Filters and processes the dataset based...
[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...
[uncategorized] ~57-~57: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3.meds-tab-tabularize-time-series
: Iterates through combinations of a shar...
[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4.meds-tab-cache-task
: Aligns task-specific labels with the ne...
[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...
[uncategorized] ~85-~85: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5.meds-tab-xgboost
: Trains an XGBoost model using user-spec...
[uncategorized] ~100-~100: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1.generate-permutations
: Generates and prints a sorted list of a...
[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...
[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...
Additional comments not posted (1)
docs/source/overview.md (1)
102-102
: Add a comma after introductory phrase for clarity.The phrase "For example" should be followed by a comma to improve readability.
- For example you can directly call **`generate-permutations`** + For example, you can directly call **`generate-permutations`**Likely invalid or redundant comment.
Tools
LanguageTool
[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...
|
||
This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument. | ||
|
||
2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify conjunction usage in sentence.
The use of "thus" in the description of meds-tab-tabularize-static
might confuse readers as it suggests a cause-effect relationship not clearly stated.
- thus rows are duplicated across multiple timestamps for the same patient.
+ Consequently, rows are duplicated across multiple timestamps for the same patient.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient. | |
2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, consequently, rows are duplicated across multiple timestamps for the same patient. |
Tools
LanguageTool
[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2.meds-tab-tabularize-static
: Filters and processes the dataset based...
[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- docs/source/installation.md (1 hunks)
- docs/source/overview.md (1 hunks)
Additional context used
Markdownlint
docs/source/installation.md
19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document
23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a headingdocs/source/overview.md
12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
LanguageTool
docs/source/overview.md
[uncategorized] ~34-~34: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...
[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2.meds-tab-tabularize-static
: Filters and processes the dataset based...
[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...
[uncategorized] ~57-~57: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3.meds-tab-tabularize-time-series
: Iterates through combinations of a shar...
[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4.meds-tab-cache-task
: Aligns task-specific labels with the ne...
[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...
[uncategorized] ~85-~85: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5.meds-tab-xgboost
: Trains an XGBoost model using user-spec...
[uncategorized] ~100-~100: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1.generate-permutations
: Generates and prints a sorted list of a...
[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...
[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...
docs/source/installation.md
Outdated
# Scalable tabularization and tabular feature usage utilities over generic MEDS datasets | ||
|
||
<p align="center"> | ||
<a href="https://www.python.org/downloads/release/python-3100/"><img alt="Python" src="https://img.shields.io/badge/-Python_3.12+-blue?logo=python&logoColor=white"></a> | ||
<a href="https://pypi.org/project/meds-tab/"><img alt="PyPI" src="https://img.shields.io/badge/PyPI-v0.2.5-orange?logoColor=orange"></a> | ||
<a href="https://hydra.cc/"><img alt="Hydra" src="https://img.shields.io/badge/Config-Hydra_1.3-89b8cd"></a> | ||
<a href="https://codecov.io/gh/mmcdermott/MEDS_Tabular_AutoML"><img alt="Codecov" src="https://codecov.io/gh/mmcdermott/MEDS_Tabular_AutoML/graph/badge.svg?token=6GD05EDQ39"></a> | ||
<a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML/actions/workflows/tests.yaml"><img alt="Tests" src="https://github.com/mmcdermott/MEDS_Tabular_AutoML/actions/workflows/tests.yaml/badge.svg"></a> | ||
<a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML/actions/workflows/code-quality-main.yaml"><img alt="Code Quality" src="https://github.com/mmcdermott/MEDS_Tabular_AutoML/actions/workflows/code-quality-main.yaml/badge.svg"></a> | ||
<a href='https://meds-tab.readthedocs.io/en/latest/?badge=latest'><img src='https://readthedocs.org/projects/meds-tab/badge/?version=latest' alt='Documentation Status' /></a> | ||
<a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML/graphs/contributors"><img alt="Contributors" src="https://img.shields.io/github/contributors/mmcdermott/MEDS_Tabular_AutoML.svg"></a> | ||
<a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML/pulls"><img alt="Pull Requests" src="https://img.shields.io/badge/PRs-welcome-brightgreen.svg"></a> | ||
<a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML#license"><img alt="License" src="https://img.shields.io/badge/License-MIT-green.svg?labelColor=gray"></a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve badge alignment and presentation.
Consider using a table or a more structured layout for the badges to enhance readability and visual appeal.
# Usage | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify the repository's purpose in the introduction.
The introduction could be expanded to provide more details about the specific capabilities and advantages of the repository.
## Quick Start | ||
|
||
To use MEDS-Tab, install the dependencies following commands below: | ||
|
||
**Pip Install** | ||
|
||
```console | ||
pip install meds-tab | ||
``` | ||
|
||
**Local Install** | ||
|
||
```console | ||
# clone the git repo | ||
pip install . | ||
``` | ||
|
||
## Scripts and Examples | ||
|
||
For an end to end example over MIMIC-IV, see the [MIMIC-IV companion repository](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV). | ||
For an end to end example over Philips eICU, see the [eICU companion repository](https://github.com/mmcdermott/MEDS_TAB_EICU). | ||
|
||
See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhance the Quick Start and Scripts sections with more detailed examples.
Providing more detailed examples and explanations can help new users better understand how to use the repository effectively.
Tools
Markdownlint
12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
|
||
To use MEDS-Tab, install the dependencies following commands below: | ||
|
||
**Pip Install** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert emphasized text to headings for clarity and consistency.
Using headings instead of bold text for section titles can improve the structure and readability of the document.
-**Pip Install**
+### Pip Install
-**Local Install**
+### Local Install
Also applies to: 18-18
Tools
Markdownlint
12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 18
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- README.md (3 hunks)
- docs/source/implementation.md (1 hunks)
- docs/source/installation.md (1 hunks)
- docs/source/profiling.md (1 hunks)
- pyproject.toml (1 hunks)
Files skipped from review due to trivial changes (1)
- pyproject.toml
Additional context used
Markdownlint
docs/source/installation.md
19-19: null
Multiple top-level headings in the same document(MD025, single-title, single-h1)
23-23: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
29-29: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
README.md
23-23: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
29-29: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
185-185: null
Multiple top-level headings in the same document(MD025, single-title, single-h1)
47-47: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
53-53: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
309-309: null
Multiple top-level headings in the same document(MD025, single-title, single-h1)
400-400: null
Multiple top-level headings in the same document(MD025, single-title, single-h1)
520-520: null
Spaces inside code span elements(MD038, no-space-in-code)
215-215: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
276-276: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
414-414: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
484-484: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
LanguageTool
docs/source/implementation.md
[uncategorized] ~14-~14: Possible missing comma found.
Context: ...reded dataset. We expect a structure as follows where each shard contains a subset of t...(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~64-~64: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative.
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...(A_LOT_OF)
[style] ~93-~93: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...(SPEED_UP_ACCELERATE)
docs/source/profiling.md
[uncategorized] ~20-~20: Possible missing comma found.
Context: ...w that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-...(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...(ALL_OF_THE)
[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...(MISSING_HYPHEN)
README.md
[uncategorized] ~69-~69: Loose punctuation mark.
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~78-~78: Loose punctuation mark.
Context: ...nt. 2.meds-tab-tabularize-static
: Filters and processes the dataset based...(UNLIKELY_OPENING_PUNCTUATION)
[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...(THUS_SENTENCE)
[uncategorized] ~92-~92: Loose punctuation mark.
Context: ...3.meds-tab-tabularize-time-series
: Iterates through combinations of a shar...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~107-~107: Loose punctuation mark.
Context: ...ax] ``` 4.meds-tab-cache-task
: Aligns task-specific labels with the ne...(UNLIKELY_OPENING_PUNCTUATION)
[grammar] ~109-~109: Possible subject-verb agreement error detected.
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...(PLURAL_THAT_AGREEMENT)
[uncategorized] ~120-~120: Loose punctuation mark.
Context: ...e/max] ``` 5.meds-tab-xgboost
: Trains an XGBoost model using user-spec...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~135-~135: Loose punctuation mark.
Context: ... Scripts 1.generate-permutations
: Generates and prints a sorted list of a...(UNLIKELY_OPENING_PUNCTUATION)
[typographical] ~137-~137: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...(COMMA_FOR_EXAMPLE)
[uncategorized] ~160-~160: Possible missing comma found.
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~170-~170: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...(EN_WORDINESS_PREMIUM_PRIOR_TO)
[typographical] ~202-~202: It appears that a comma is missing.
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...(DURING_THAT_TIME_COMMA)
[uncategorized] ~208-~208: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...(SPECIFIC_HYPHEN)
[uncategorized] ~213-~213: Possible missing comma found.
Context: ...reded dataset. We expect a structure as follows where each shard contains a subset of t...(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~265-~265: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative.
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...(A_LOT_OF)
[style] ~294-~294: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...(SPEED_UP_ACCELERATE)
[uncategorized] ~330-~330: Possible missing comma found.
Context: ... more memory efficient version of their method which we denotecatabra-mem
. Other li...(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~330-~330: Possible missing comma found.
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~334-~334: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...(ALL_OF_THE)
[style] ~334-~334: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...(EN_WEAK_ADJECTIVE)
[uncategorized] ~334-~334: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...(MISSING_HYPHEN)
[uncategorized] ~462-~462: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...(SPECIFIC_HYPHEN)
Additional comments not posted (10)
docs/source/installation.md (2)
3-13
: Improve badge alignment and presentation.Consider using a table or a more structured layout for the badges to enhance readability and visual appeal.
16-17
: LGTM!The description of the repository is clear and concise.
docs/source/implementation.md (4)
5-10
: LGTM!The description of the method's discrete parts is clear and concise.
36-37
: LGTM!The description of the data loading and sharding process is clear and concise.
70-72
: LGTM!The description of the output storage process is clear and concise.
100-106
: LGTM!The description of the XGBoost training process is clear and concise.
docs/source/profiling.md (3)
1-3
: LGTM!The heading and introductory paragraph are correctly formatted.
5-11
: LGTM!The description of the system comparison overview is clear and concise.
12-17
: LGTM!The description of the benchmarking tests is clear and concise.
README.md (1)
40-41
: Rephrase for clarity and conciseness.Improve the description for better readability.
- 1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date. + 1. Efficient construction and loading of flat, non-longitudinal summary features for patient records in MEDS, covering arbitrary time-windows (e.g., 1 year, 6 months) backward from a specified index date.Likely invalid or redundant comment.
README.md
Outdated
@@ -22,13 +22,13 @@ To use MEDS-Tab, install the dependencies following commands below: | |||
|
|||
**Pip Install** | |||
|
|||
```bash | |||
```console |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use headings instead of emphasis for installation methods.
Replace emphasis with proper headings to adhere to markdown standards.
- **Pip Install**
+ ### Pip Install
- **Local Install**
+ ### Local Install
Also applies to: 31-31
### Scripts and Examples | ||
To use MEDS-Tab, install the dependencies following commands below: | ||
|
||
**Pip Install** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use headings instead of emphasis for installation methods.
Replace emphasis with proper headings to adhere to markdown standards.
- **Pip Install**
+ ### Pip Install
- **Local Install**
+ ### Local Install
Also applies to: 53-53
Tools
Markdownlint
47-47: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
|
||
1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types | ||
1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix grammatical issues and improve clarity.
Address loose punctuation marks and improve descriptions for better readability.
- 1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories:
+ 1. **`meds-tab-describe`**: Computes the frequencies of different code-types in MEDS data shards, categorizing them as:
- **Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
+ **Example: Tabularizing Static Data**: Minimum code frequency: 10, Window sizes: `[1d, 30d, 365d, full]`, Aggregation methods: `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
- 3. **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard, `window_size`, and `aggregation` to generate feature vectors that aggregate patient data for each unique `patient_id` x `timestamp`.
+ 3. **`meds-tab-tabularize-time-series`**: Aggregates patient data for each unique `patient_id` x `timestamp` using combinations of `window_size` and `aggregation`.
- 4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
+ 4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
- 5. **`meds-tab-xgboost`**: Trains an XGBoost model using user-specified parameters. Permutations of `window_sizes` and `aggs` can be generated using `generate-permutations` command (See the section below for descriptions).
+ 5. **`meds-tab-xgboost`**: Trains an XGBoost model using user-specified parameters. Permutations of `window_sizes` and `aggs` can be generated using the `generate-permutations` command (see below for descriptions).
- For example you can directly call **`generate-permutations`** in the command line:
+ For example, you can directly call **`generate-permutations`** in the command line:
Also applies to: 78-80, 92-92, 107-107, 120-120, 135-137
Tools
LanguageTool
[uncategorized] ~69-~69: Loose punctuation mark.
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...(UNLIKELY_OPENING_PUNCTUATION)
README.md
Outdated
## Implementation Improvements | ||
## The MEDS-Tab Architecture | ||
|
||
In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix grammatical issues and improve clarity.
Address missing commas, hyphenation, and wordiness for better readability.
- In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
+ In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization and XGBoost tuning processes.
- Given time series data tabularize it 3. cache task specific rows of data for efficient loading 4. XGBoost training
+ Given time series data tabularize it 3. cache task-specific rows of data for efficient loading 4. XGBoost training
- This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients:
+ This initial stage processes a pre-sharded dataset. We expect a structure as follows, where each shard contains a subset of the patients:
- Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly speeds up the rolling window aggregations on datasets that have lots of concurrent observations.
+ Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly accelerates the rolling window aggregations on datasets with many concurrent observations.
- This reduces the memory footprint and speeds up the training process.
+ This reduces the memory footprint and accelerates the training process.
Also applies to: 208-208, 213-213, 265-265, 294-294
Tools
LanguageTool
[typographical] ~202-~202: It appears that a comma is missing.
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...(DURING_THAT_TIME_COMMA)
README.md
Outdated
|
||
## MEDS-Tab Tabularization Technique | ||
|
||
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMICIV and EICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for EICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix grammatical issues and improve clarity.
Address missing commas, hyphenation, and wordiness for better readability.
- Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`.
+ Tabularization of time-series data, as depicted above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory-efficient version of their method, which we denote `catabra-mem`.
- The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly.
+ The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are a small number of patients (10, 100, and 500 patients) and should be processed quickly.
- #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size
+ #### 1.1.2 MIMIC-IV Task-Specific Training Cohort Size
Also applies to: 334-334, 462-462
Tools
LanguageTool
[uncategorized] ~330-~330: Possible missing comma found.
Context: ... more memory efficient version of their method which we denotecatabra-mem
. Other li...(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~330-~330: Possible missing comma found.
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...(AI_HYDRA_LEO_MISSING_COMMA)
docs/source/installation.md
Outdated
|
||
To use MEDS-Tab, install the dependencies following commands below: | ||
|
||
**Pip Install** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert emphasized text to headings for clarity and consistency.
Using headings instead of bold text for section titles improves the structure and readability of the document.
-**Pip Install**
+### Pip Install
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
**Pip Install** | |
### Pip Install |
Tools
Markdownlint
23-23: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
# The MEDS-Tab Architecture | ||
|
||
In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a comma for better readability.
The introductory sentence could benefit from a comma after "In this section" to improve the flow and readability.
- In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
+ In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# The MEDS-Tab Architecture | |
In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process. | |
# The MEDS-Tab Architecture | |
In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process. |
|
||
## 1. Describe Codes (compute feature frequencies) | ||
|
||
This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a comma after 'as follows'.
This would clarify the separation between the introductory clause and the detailed description that follows.
- This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients:
+ This initial stage processes a pre-shareded dataset. We expect a structure as follows, where each shard contains a subset of the patients:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients: | |
This initial stage processes a pre-shareded dataset. We expect a structure as follows, where each shard contains a subset of the patients: |
Tools
LanguageTool
[uncategorized] ~14-~14: Possible missing comma found.
Context: ...reded dataset. We expect a structure as follows where each shard contains a subset of t...(AI_HYDRA_LEO_MISSING_COMMA)
|
||
**Detailed Workflow:** | ||
|
||
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a stronger verb choice.
Replacing "speeds up" with "accelerates" might provide a stronger and more formal expression in the documentation.
- This reduces the memory footprint and speeds up the training process.
+ This reduces the memory footprint and accelerates the training process.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process. | |
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and accelerates the training process. |
Tools
LanguageTool
[style] ~93-~93: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...(SPEED_UP_ACCELERATE)
docs/source/implementation.md
Outdated
|
||
3. **Event Aggregation**: | ||
|
||
- Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly speeds up the rolling window aggregations on datasets that have lots of concurrent observations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider replacing "lots of" with "many" to enhance formality and clarity.
The phrase "lots of" might be considered informal. Using "many" can improve the formality and clarity of the documentation.
- ...ndow aggregations on datasets that have lots of concurrent observations.
+ ...ndow aggregations on datasets that have many concurrent observations.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly speeds up the rolling window aggregations on datasets that have lots of concurrent observations. | |
- Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly speeds up the rolling window aggregations on datasets that have many concurrent observations. |
Tools
LanguageTool
[style] ~64-~64: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative.
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...(A_LOT_OF)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (3)
docs/assets/dark_purple_meds_tab.png
is excluded by!**/*.png
docs/assets/light_purple_meds_tab.png
is excluded by!**/*.png
docs/assets/white_meds_tab.png
is excluded by!**/*.png
Files selected for processing (7)
- docs/generate.sh (1 hunks)
- docs/source/_static/switcher.json (1 hunks)
- docs/source/conf.py (2 hunks)
- docs/source/implementation.md (1 hunks)
- docs/source/index.md (1 hunks)
- docs/source/overview.md (1 hunks)
- src/MEDS_tabular_automl/init.py (1 hunks)
Files skipped from review due to trivial changes (2)
- docs/generate.sh
- src/MEDS_tabular_automl/init.py
Additional context used
Biome
docs/source/_static/switcher.json
[error] 12-12: Expected an array, an object, or a literal but instead found ']'.
Expected an array, an object, or a literal here.
(parse)
LanguageTool
docs/source/index.md
[style] ~28-~28: Consider a shorter alternative to avoid wordiness.
Context: ...ed across arbitrary tasks and settings. In order to use MEDS-Tab, you will first need to tr...(IN_ORDER_TO_PREMIUM)
[style] ~39-~39: Opting for a less wordy alternative here can improve the clarity of your writing.
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...(NOT_ONLY_ALSO)
[style] ~39-~39: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 2513 characters long)
Context: ... datasets in reasonable raw formulations!(EN_EXCESSIVE_EXCLAMATION)
docs/source/implementation.md
[style] ~92-~92: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...(SPEED_UP_ACCELERATE)
docs/source/overview.md
[uncategorized] ~34-~34: Loose punctuation mark.
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~43-~43: Loose punctuation mark.
Context: ...nt. 2.meds-tab-tabularize-static
: Filters and processes the dataset based...(UNLIKELY_OPENING_PUNCTUATION)
[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...(THUS_SENTENCE)
[uncategorized] ~57-~57: Loose punctuation mark.
Context: ...3.meds-tab-tabularize-time-series
: Iterates through combinations of a shar...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~72-~72: Loose punctuation mark.
Context: ...ax] ``` 4.meds-tab-cache-task
: Aligns task-specific labels with the ne...(UNLIKELY_OPENING_PUNCTUATION)
[grammar] ~74-~74: Possible subject-verb agreement error detected.
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...(PLURAL_THAT_AGREEMENT)
[uncategorized] ~85-~85: Loose punctuation mark.
Context: ...e/max] ``` 5.meds-tab-xgboost
: Trains an XGBoost model using user-spec...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~100-~100: Loose punctuation mark.
Context: ... Scripts 1.generate-permutations
: Generates and prints a sorted list of a...(UNLIKELY_OPENING_PUNCTUATION)
[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...(COMMA_FOR_EXAMPLE)
[uncategorized] ~125-~125: Possible missing comma found.
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...(EN_WORDINESS_PREMIUM_PRIOR_TO)
Markdownlint
docs/source/overview.md
12-12: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
18-18: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
Additional comments not posted (14)
docs/source/index.md (4)
3-3
: Enhance the introductory sentence for greater impact.The current introduction could be more engaging. Consider emphasizing the unique features of MEDS-Tab early in the description.
[EXISTING_COMMENT]- MEDS-Tab is a library designed for automated tabularization, data preparation with aggregation, and time windowing. Check out below for an overview of MEDS-Tab and how it could be useful in your workflows! + MEDS-Tab revolutionizes medical data handling by automating tabularization, data preparation, and time windowing. Discover how MEDS-Tab can transform your medical data workflows below!
23-24
: Enhance the bullet points for clarity and specificity.The bullet points under "Why MEDS-Tab?" could benefit from more specific examples or metrics to substantiate the claims of cost efficiency and performance.
[EXISTING_COMMENT]- - Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions - - Strong Performance: MEDS-Tab provides robustness and high performance across various datasets compared with other frameworks. + - Cost Efficiency: MEDS-Tab reduces operational costs by up to 50% compared to traditional data handling solutions. + - Strong Performance: In benchmarks, MEDS-Tab consistently outperforms other frameworks, enhancing predictive accuracy by 20%.
28-28
: Streamline the wording for clarity.The phrase "In order to" can be simplified to make the sentence more direct and concise.
[EXISTING_COMMENT]- In order to use MEDS-Tab, you will first need to transform your raw EHR data to a MEDS format, which can be done using the following libraries: + To use MEDS-Tab, first transform your raw EHR data to a MEDS format using the following libraries:Tools
LanguageTool
[style] ~28-~28: Consider a shorter alternative to avoid wordiness.
Context: ...ed across arbitrary tasks and settings. In order to use MEDS-Tab, you will first need to tr...(IN_ORDER_TO_PREMIUM)
39-39
: Improve the sentence structure for enhanced readability.The phrase "not only... but also" can be simplified to improve the flow of the text.
[EXISTING_COMMENT]- This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning tasks for health projects. + This approach simplifies the process and ensures high-quality, reproducible results for your machine learning tasks in health projects.Tools
LanguageTool
[style] ~39-~39: Opting for a less wordy alternative here can improve the clarity of your writing.
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...(NOT_ONLY_ALSO)
[style] ~39-~39: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 2513 characters long)
Context: ... datasets in reasonable raw formulations!(EN_EXCESSIVE_EXCLAMATION)
docs/source/implementation.md (4)
1-3
: Consider adding a comma for better readability.The introductory sentence could benefit from a comma after "In this section" to improve the flow and readability.
[EXISTING_COMMENT]- In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process. + In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
14-14
: Consider adding a comma after 'as follows'.This would clarify the separation between the introductory clause and the detailed description that follows.
[EXISTING_COMMENT]- This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients: + This initial stage processes a pre-shareded dataset. We expect a structure as follows, where each shard contains a subset of the patients:
71-71
: Insert "a" before "Sparse array" to correct the determiner omission.A determiner appears to be missing. Consider inserting it.
[EXISTING_COMMENT]- Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk. + A Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.
92-92
: Consider using a stronger verb choice.Replacing "speeds up" with "accelerates" might provide a stronger and more formal expression in the documentation.
[EXISTING_COMMENT]- This reduces the memory footprint and speeds up the training process. + This reduces the memory footprint and accelerates the training process.Tools
LanguageTool
[style] ~92-~92: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...(SPEED_UP_ACCELERATE)
docs/source/overview.md (5)
1-2
: Clarify the repository's purpose in the introduction.The introduction could be expanded to provide more details about the specific capabilities and advantages of the repository.
[EXISTING_COMMENT]
12-12
: Convert emphasized text to headings for clarity and consistency.Using headings instead of bold text for section titles can improve the structure and readability of the document.
[EXISTING_COMMENT]-**Pip Install** +### Pip Install -**Local Install** +### Local InstallTools
Markdownlint
12-12: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
18-18
: Convert emphasis to heading for clarity and consistency.The emphasis on "Local Install" should be converted to a heading to maintain consistency and improve document structure.
[EXISTING_COMMENT]- **Local Install** + ## Local InstallTools
Markdownlint
18-18: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
43-43
: Clarify conjunction usage in sentence.The use of "thus" in the description of
meds-tab-tabularize-static
might confuse readers as it suggests a cause-effect relationship not clearly stated.
[EXISTING_COMMENT]- thus rows are duplicated across multiple timestamps for the same patient. + Consequently, rows are duplicated across multiple timestamps for the same patient.Tools
LanguageTool
[uncategorized] ~43-~43: Loose punctuation mark.
Context: ...nt. 2.meds-tab-tabularize-static
: Filters and processes the dataset based...(UNLIKELY_OPENING_PUNCTUATION)
[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...(THUS_SENTENCE)
102-102
: Add a comma after introductory phrase for clarity.The phrase "For example" should be followed by a comma to improve readability.
[EXISTING_COMMENT]- For example you can directly call **`generate-permutations`** + For example, you can directly call **`generate-permutations`**Tools
LanguageTool
[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...(COMMA_FOR_EXAMPLE)
docs/source/conf.py (1)
24-24
: Ensureversion_match
is correctly set.The environment variable
READTHEDOCS_VERSION
might not always be set, leading to potential issues. Ensure there is a fallback mechanism.- version_match = os.environ.get("READTHEDOCS_VERSION") + version_match = os.environ.get("READTHEDOCS_VERSION", "dev")
"url": "https://pydata-sphinx-theme.readthedocs.io/en/stable/", | ||
"preferred": true | ||
}, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix JSON syntax error.
The JSON array is not correctly closed. Add a closing bracket to fix the syntax error.
- ]
+]
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
] | |
] |
Tools
Biome
[error] 12-12: Expected an array, an object, or a literal but instead found ']'.
Expected an array, an object, or a literal here.
(parse)
@@ -1,5 +1,9 @@ | |||
import os | |||
import shutil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import.
The shutil
module is imported but not used in the script.
-import shutil
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import shutil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- docs/source/_static/switcher.json (1 hunks)
- docs/source/conf.py (2 hunks)
Files skipped from review as they are similar to previous changes (1)
- docs/source/_static/switcher.json
Additional comments not posted (32)
docs/source/conf.py (32)
4-4
: Approved: ImportingPath
frompathlib
.The
Path
module is used for path operations in the script.
17-18
: Approved: Updated project metadata.The project metadata, including
project
,copyright
, andauthor
, has been updated appropriately.
22-22
: Approved: Addedjson_url
variable.The
json_url
variable is defined for the version switcher.
24-24
: Approved: Addedversion_match
variable.The
version_match
variable is defined for version handling.
25-25
: Approved: Setrelease
variable.The
release
variable is set to the version ofMEDS_tabular_automl
.
29-41
: Approved: Logic for handlingversion_match
.The script includes logic to handle different values of
version_match
.
49-49
: Approved: Set language to "en".The language for the documentation is set to English.
52-53
: Approved: Defined__location__
and__src__
.The
__location__
and__src__
variables are defined usingPath
.
58-58
: Approved: Added source directory tosys.path
.The script adds the source directory to
sys.path
.
61-76
: Approved: Definedensure_pandoc_installed
function.The function
ensure_pandoc_installed
ensures Pandoc is installed.
79-96
: Approved: Script for runningsphinx-apidoc
.The script runs
sphinx-apidoc
automatically.
108-119
: Approved: Added Sphinx extensions.The list of Sphinx extensions includes several new extensions.
123-123
: Approved: Set HTML theme topydata_sphinx_theme
.The HTML theme is set to
pydata_sphinx_theme
.
125-129
: Approved: Updated sidebar configuration.The sidebar configuration for the HTML theme is updated.
131-131
: Approved: Setnbsphinx_allow_errors
toTrue
.The
nbsphinx_allow_errors
variable is set toTrue
.
134-136
: Approved: Defined and createdcollections_dir
.The
collections_dir
is defined and created if it does not exist.
138-138
: Approved: Definedpython_version
variable.The
python_version
variable is defined for intersphinx mapping.
140-153
: Approved: Updated intersphinx mapping.The intersphinx mapping includes several new entries.
162-162
: Approved: Updatedexclude_patterns
variable.The
exclude_patterns
variable is updated.
165-165
: Approved: Setnb_execution_mode
to "off".The
nb_execution_mode
variable is set to "off".
171-182
: Approved: Updatedmyst_enable_extensions
variable.The
myst_enable_extensions
variable includes several new extensions.
184-184
: Approved: Setmyst_update_mathjax
toTrue
.The
myst_update_mathjax
variable is set toTrue
.
187-199
: Approved: Updatedmyst_url_schemes
variable.The
myst_url_schemes
variable includes several new URL schemes.
202-202
: Approved: Updatedsource_suffix
variable.The
source_suffix
variable is updated.
208-208
: Approved: Setmaster_doc
to "index".The
master_doc
variable is set to "index".
225-225
: Approved: Setpygments_style
to "tango".The
pygments_style
variable is set to "tango".
235-235
: Approved: Settodo_emit_warnings
toTrue
.The
todo_emit_warnings
variable is set toTrue
.
243-244
: Approved: Sethtml_title
andhtml_short_title
.The
html_title
andhtml_short_title
variables are set.
249-287
: Approved: Updatedhtml_theme_options
variable.The
html_theme_options
variable includes several new options.
289-293
: Approved: Definedhtml_context
variable.The
html_context
variable is defined.
328-329
: Approved: Sethtml_show_copyright
andhtmlhelp_basename
.The
html_show_copyright
andhtmlhelp_basename
variables are set.
2-2
: Remove unused import.The
shutil
module is imported but not used in the script.-import shutil
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Outside diff range and nitpick comments (6)
docs/source/profiling.md (5)
7-7
: Add a comma for better readability.The sentence could benefit from a comma after "comparison" to improve the flow and readability.
- Specifically, for comparison we used: + Specifically, for comparison, we used:
24-24
: Remove "of" for conciseness.The phrase "in all of the scenarios" can be simplified to "in all scenarios" to enhance readability and conciseness.
- ...emphasizing the better performance of MEDS-Tab in all of the scenarios. + ...emphasizing the better performance of MEDS-Tab in all scenarios.Tools
LanguageTool
[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...(ALL_OF_THE)
[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...(MISSING_HYPHEN)
24-24
: Correct hyphenation in adjectival compound.When a number forms part of an adjectival compound, it should be hyphenated to improve readability.
- Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget. + Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget.Tools
LanguageTool
[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...(ALL_OF_THE)
[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...(MISSING_HYPHEN)
20-20
: Consider adding a comma for better readability.The sentence could benefit from a comma after "datasets" to improve the flow and readability.
- ...w that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. + ...w that on the MIMIC-IV and eICU medical datasets, we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab.
84-84
: Ensure the file ends with a single newline character.Files should end with a single newline character to adhere to best practices.
+ \n
docs/source/prediction.md (1)
85-85
: Specify language for fenced code blocks to adhere to Markdown best practices.Fenced code blocks should have a language specified.
- ``` + ```bash
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- docs/source/prediction.md (1 hunks)
- docs/source/profiling.md (1 hunks)
Additional context used
LanguageTool
docs/source/profiling.md
[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...(ALL_OF_THE)
[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...(MISSING_HYPHEN)
docs/source/prediction.md
[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... 11,830 | #### 1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...(SPECIFIC_HYPHEN)
[uncategorized] ~195-~195: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... | 14 | #### 3. eICU Task Specific Training Cohort Size | Task ...(SPECIFIC_HYPHEN)
Markdownlint
docs/source/prediction.md
121-121: null
Spaces inside code span elements(MD038, no-space-in-code)
tabularization.min_code_inclusion_frequency: tag(log, range(10, 1000000)) | ||
``` | ||
|
||
Note that the XGBoost command shown includes `tabularization.window_sizes` and ` tabularization.aggs` in the parameters to sweep over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove spaces inside code span elements.
Spaces inside code span elements should be removed to adhere to best practices.
- ` tabularization.aggs`
+ `tabularization.aggs`
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Note that the XGBoost command shown includes `tabularization.window_sizes` and ` tabularization.aggs` in the parameters to sweep over. | |
Note that the XGBoost command shown includes `tabularization.window_sizes` and `tabularization.aggs` in the parameters to sweep over. |
Tools
Markdownlint
121-121: null
Spaces inside code span elements(MD038, no-space-in-code)
| LOS in Hospital > 3 days | Admission + 24 hr | 6m5s | 7m5s | 1m4s | 11,012 | 12,223 | | ||
| LOS in Hospital > 3 days | Admission + 48 hr | 6m10s | 7m12s | 1m4s | 10,703 | 11,830 | | ||
|
||
#### 1.2 MIMIC-IV Task Specific Training Cohort Size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hyphenate "Task-Specific" for grammatical correctness.
When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
- #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size
+ #### 1.1.2 MIMIC-IV Task-Specific Training Cohort Size
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
#### 1.2 MIMIC-IV Task Specific Training Cohort Size | |
#### 1.2 MIMIC-IV Task-Specific Training Cohort Size |
Tools
LanguageTool
[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... 11,830 | #### 1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...(SPECIFIC_HYPHEN)
| Post-discharge 30 day Mortality | Discharge | 0.003 | 0.0116 | 0.001 | 0.730 | 13 | 986 | 7 | 7 | | ||
| Post-discharge 1 year Mortality | Discharge | 0.005 | 0.006 | 0.002 | 0.690 | 93 | 938 | 6 | 14 | | ||
|
||
#### 3. eICU Task Specific Training Cohort Size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hyphenate "Task-Specific" for grammatical correctness.
When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
- #### 3. eICU Task Specific Training Cohort Size
+ #### 3. eICU Task-Specific Training Cohort Size
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
#### 3. eICU Task Specific Training Cohort Size | |
#### 3. eICU Task-Specific Training Cohort Size |
Tools
LanguageTool
[uncategorized] ~195-~195: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... | 14 | #### 3. eICU Task Specific Training Cohort Size | Task ...(SPECIFIC_HYPHEN)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 19
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- README.md (2 hunks)
- docs/source/profiling.md (1 hunks)
Files not summarized due to errors (1)
- README.md: Error: Message exceeds token limit
Additional context used
LanguageTool
docs/source/profiling.md
[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...(ALL_OF_THE)
[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...(MISSING_HYPHEN)
README.md
[style] ~61-~61: Consider a shorter alternative to avoid wordiness.
Context: ...ed across arbitrary tasks and settings. In order to use MEDS-Tab, you will first need to tr...(IN_ORDER_TO_PREMIUM)
[style] ~72-~72: Opting for a less wordy alternative here can improve the clarity of your writing.
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...(NOT_ONLY_ALSO)
[uncategorized] ~76-~76: Loose punctuation mark.
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~85-~85: Loose punctuation mark.
Context: ...nt. 2.meds-tab-tabularize-static
: Filters and processes the dataset based...(UNLIKELY_OPENING_PUNCTUATION)
[typographical] ~85-~85: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...o a uniquepatient_id
andtimestamp
combination, thus rows are duplicated across multiple tim...(THUS_SENTENCE)
[uncategorized] ~99-~99: Loose punctuation mark.
Context: ...3.meds-tab-tabularize-time-series
: Iterates through combinations of a shar...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~114-~114: Loose punctuation mark.
Context: ...ax] ``` 4.meds-tab-cache-task
: Aligns task-specific labels with the ne...(UNLIKELY_OPENING_PUNCTUATION)
[grammar] ~116-~116: Possible subject-verb agreement error detected.
Context: ...a specific task$TASK
and labels that has pulled from [ACES](https://github.com/j...(PLURAL_THAT_AGREEMENT)
[uncategorized] ~127-~127: Loose punctuation mark.
Context: ...e/max] ``` 5.meds-tab-xgboost
: Trains an XGBoost model using user-spec...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~142-~142: Loose punctuation mark.
Context: ... Scripts 1.generate-permutations
: Generates and prints a sorted list of a...(UNLIKELY_OPENING_PUNCTUATION)
[typographical] ~144-~144: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...(COMMA_FOR_EXAMPLE)
[uncategorized] ~167-~167: Possible missing comma found.
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~177-~177: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...(EN_WORDINESS_PREMIUM_PRIOR_TO)
[uncategorized] ~220-~220: Possible missing comma found.
Context: ...reded dataset. We expect a structure as follows where each shard contains a subset of t...(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~298-~298: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...(SPEED_UP_ACCELERATE)
[uncategorized] ~334-~334: Possible missing comma found.
Context: ... more memory efficient version of their method which we denotecatabra-mem
. Other li...(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~338-~338: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...(ALL_OF_THE)
[uncategorized] ~338-~338: Possible missing comma found.
Context: ... thatcatabra-mem
is omitted from the tables as it never completed within the 10 min...(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~338-~338: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...(MISSING_HYPHEN)
[uncategorized] ~462-~462: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... 11,830 | #### 1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...(SPECIFIC_HYPHEN)
[uncategorized] ~594-~594: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... | 14 | #### 3. eICU Task Specific Training Cohort Size | Task ...(SPECIFIC_HYPHEN)
Markdownlint
README.md
21-21: null
Multiple top-level headings in the same document(MD025, single-title, single-h1)
32-32: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
38-38: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
207-207: null
Multiple top-level headings in the same document(MD025, single-title, single-h1)
315-315: null
Multiple top-level headings in the same document(MD025, single-title, single-h1)
400-400: null
Multiple top-level headings in the same document(MD025, single-title, single-h1)
520-520: null
Spaces inside code span elements(MD038, no-space-in-code)
Additional comments not posted (4)
docs/source/profiling.md (2)
20-20
: Correct the spelling error.The word "depecited" should be corrected to "depicted."
- Tabularization of time-series data, as depecited above, is commonly used in several past works. + Tabularization of time-series data, as depicted above, is commonly used in several past works.Likely invalid or redundant comment.
24-24
: Remove "of" after "all" for conciseness.The phrase "in all of the scenarios" can be simplified to "in all scenarios" to enhance readability and conciseness.
- ...ing the better performance of MEDS-Tab in all of the scenarios. + ...ing the better performance of MEDS-Tab in all scenarios.Likely invalid or redundant comment.
Tools
LanguageTool
[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...(ALL_OF_THE)
[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...(MISSING_HYPHEN)
README.md (2)
52-58
: LGTM!The "Why MEDS-Tab?" section is clear and informative.
400-400
: LGTM!The "Prediction Performance" section is clear and informative.
Tools
Markdownlint
400-400: null
Multiple top-level headings in the same document(MD025, single-title, single-h1)
|
||
### MEDS-Tab Tabularization Technique | ||
|
||
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comma for better readability.
Consider adding a comma after "method" for better readability.
- ...more memory efficient version of their method which we denote `catabra-mem`.
+ ...more memory efficient version of their method, which we denote `catabra-mem`.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below. | |
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method, which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below. |
|
||
## 2. Comparative Performance Analysis | ||
|
||
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct hyphenation in adjectival compound.
When a number forms part of an adjectival compound, it should be hyphenated to improve readability.
- Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
+ Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget. | |
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget. |
Tools
LanguageTool
[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...(ALL_OF_THE)
[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...(MISSING_HYPHEN)
1. Construction and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time windows (e.g. 1 year, 6 months, etc.), which go backwards in time from a given index date. | ||
2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems. | ||
|
||
## Quick Start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use headings instead of emphasis for installation methods.
Replace emphasis with proper headings to adhere to markdown standards.
- **Pip Install**
+ ### Pip Install
Committable suggestion was skipped due to low confidence.
|
||
To use MEDS-Tab, install the dependencies following commands below: | ||
|
||
**Pip Install** | ||
|
||
```bash | ||
```console | ||
pip install meds-tab | ||
``` | ||
|
||
**Local Install** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use headings instead of emphasis for installation methods.
Replace emphasis with proper headings to adhere to markdown standards.
- **Local Install**
+ ### Local Install
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
**Local Install** | |
### Local Install |
Tools
Markdownlint
38-38: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
6. **`meds-tab-xgboost-sweep`**: Conducts an Optuna hyperparameter sweep to optimize over `window_sizes`, `aggregations`, and `min_code_inclusion_frequency`, aiming to enhance model performance and adaptability. | ||
|
||
### Additional CLI Scripts | ||
## Additional CLI Scripts | ||
|
||
1. **`generate-permutations`**: Generates and prints a sorted list of all permutations from a comma separated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations. | ||
|
||
For example you can directly call **`generate-permutations`** in the command line: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comma after 'For example'.
Improve readability by adding a comma.
- For example you can directly call **`generate-permutations`** in the command line:
+ For example, you can directly call **`generate-permutations`** in the command line:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
For example you can directly call **`generate-permutations`** in the command line: | |
For example, you can directly call **`generate-permutations`** in the command line: |
Tools
LanguageTool
[typographical] ~144-~144: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...(COMMA_FOR_EXAMPLE)
|
||
**Detailed Workflow:** | ||
|
||
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a stronger verb for clarity.
Replace "speeds up" with "accelerates" for stronger wording.
- This reduces the memory footprint and speeds up the training process.
+ This reduces the memory footprint and accelerates the training process.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process. | |
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and accelerates the training process. |
Tools
LanguageTool
[style] ~298-~298: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...(SPEED_UP_ACCELERATE)
|
||
## 2. Comparative Performance Analysis | ||
|
||
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix grammatical issues and improve clarity.
Address missing commas and improve descriptions for better readability.
- The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
+ The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such a small number of patients (10, 100, and 500 patients), data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget. | |
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such a small number of patients (10, 100, and 500 patients), data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget. |
Tools
LanguageTool
[style] ~338-~338: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...(ALL_OF_THE)
[uncategorized] ~338-~338: Possible missing comma found.
Context: ... thatcatabra-mem
is omitted from the tables as it never completed within the 10 min...(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~338-~338: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...(MISSING_HYPHEN)
downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- | ||
what is more advanced is the efficient construction, storage, and loading of tabular features for the | ||
candidate AutoML models, enabling a far more extensive search over different featurization strategies. | ||
See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify the example description.
Improve the description for better readability.
- See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.
+ See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline run on synthetic data. This script is a functional test executed with `pytest` to verify the algorithm's correctness.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm. | |
See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline run on synthetic data. This script is a functional test executed with `pytest` to verify the algorithm's correctness. |
## Core CLI Scripts Overview | ||
|
||
1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories: | ||
|
||
- time-series codes (codes with timestamps) | ||
- time-series numerical values (codes with timestamps and numerical values) | ||
- static codes (codes without timestamps) | ||
- static numerical codes (codes without timestamps but with numerical values). | ||
|
||
**Caching feature names and frequencies** in a dataset stored in `"path_to_data"` | ||
|
||
``` | ||
meds-tab-describe MEDS_cohort_dir="path_to_data" | ||
``` | ||
This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix grammatical issues and improve clarity.
Address loose punctuation marks and improve descriptions for better readability.
- 1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories:
+ 1. **`meds-tab-describe`**: Computes the frequencies of different code-types in MEDS data shards, categorizing them as:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
## Core CLI Scripts Overview | |
1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories: | |
- time-series codes (codes with timestamps) | |
- time-series numerical values (codes with timestamps and numerical values) | |
- static codes (codes without timestamps) | |
- static numerical codes (codes without timestamps but with numerical values). | |
**Caching feature names and frequencies** in a dataset stored in `"path_to_data"` | |
``` | |
meds-tab-describe MEDS_cohort_dir="path_to_data" | |
``` | |
This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument. | |
## Core CLI Scripts Overview | |
1. **`meds-tab-describe`**: Computes the frequencies of different code-types in MEDS data shards, categorizing them as: | |
- time-series codes (codes with timestamps) | |
- time-series numerical values (codes with timestamps and numerical values) | |
- static codes (codes without timestamps) | |
- static numerical codes (codes without timestamps but with numerical values). | |
This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument. |
Tools
LanguageTool
[uncategorized] ~76-~76: Loose punctuation mark.
Context: ...pts Overview 1.meds-tab-describe
: This command processes MEDS data shards...(UNLIKELY_OPENING_PUNCTUATION)
|
||
### MEDS-Tab Tabularization Technique | ||
|
||
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix grammatical issues and improve clarity.
Address missing commas and improve descriptions for better readability.
- Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`.
+ Tabularization of time-series data, as depicted above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory-efficient version of their method, which we denote `catabra-mem`.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below. | |
Tabularization of time-series data, as depicted above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory-efficient version of their method, which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below. |
Tools
LanguageTool
[uncategorized] ~334-~334: Possible missing comma found.
Context: ... more memory efficient version of their method which we denotecatabra-mem
. Other li...(AI_HYDRA_LEO_MISSING_COMMA)
Summary by CodeRabbit
New Features
Documentation
index.md
,installation.md
,implementation.md
,overview.md
,prediction.md
, andprofiling.md
to provide comprehensive information on MEDS-Tab functionalities.Chores
.gitignore
to exclude specific documentation directories.docs/requirements.txt
anddocs/source/conf.py
.Refactor