Docs #36

mmcdermott · 2024-06-13T10:24:57Z

Summary by CodeRabbit

New Features
- Added detailed documentation including installation, implementation, and performance profiling of the MEDS-Tab system.
Documentation
- Enhanced README with updated installation instructions and roadmap.
- Added index.md, installation.md, implementation.md, overview.md, prediction.md, and profiling.md to provide comprehensive information on MEDS-Tab functionalities.
Chores
- Updated .gitignore to exclude specific documentation directories.
- Modified documentation dependencies and configurations in docs/requirements.txt and docs/source/conf.py.
Refactor
- Improved import statements across various scripts for better modularity and maintainability.

…ormat is still very off though.

coderabbitai · 2024-06-13T10:25:08Z

Walkthrough

The latest changes primarily enhance the MEDS-Tab system's documentation, install processes, code structure, and functionalities. New documentation files explain various features, including installation, implementation, prediction, and profiling. The .gitignore and README.md files have been updated for better clarity and project management. Import paths in several Python scripts are refactored to use relative imports, improving the internal code structure.

Changes

File/Path	Summary
.gitignore	Added exclusions for specific documentation directories.
README.md	Updated installation instructions, script descriptions, and roadmap for future improvements.
docs/make.bat	Added a conditional check for an argument and adjusted script flow.
docs/requirements.txt	Updated dependencies and added new packages like `esgpt` and `pydata-sphinx-theme`.
docs/source/conf.py	Extensive changes to project metadata, theme configurations, extensions, and version handling.
docs/source/index.md, docs/source/installation.md, docs/source/implementation.md, docs/source/overview.md, docs/source/prediction.md, docs/source/profiling.md	Added new documentation sections explaining features, usage, and comparisons of MEDS-Tab.
docs/generate.sh	Added a new script to clean build directory, generate HTML files, and serve them.
src/MEDS_...scripts/...	Refactored import statements to use relative imports for better organization.
src/MEDS_tabular_automl/utils.py	Modified `OmegaConf.register_new_resolver` to include `replace=True`.
.readthedocs.yaml	Added PDF build format option for Sphinx configuration.
pyproject.toml	Adjusted `scipy` dependency version to `<1.14.0`.

Sequence Diagram(s)

No sequence diagrams were generated as the changes are primarily related to documentation, import paths, and configuration updates rather than new features or significant alterations to control flow.

Poem

In fields of code, where rabbits hop,
They polished scripts, each line, each top.
With docs now clear as morning dew,
Installation paths, they shine anew.
Imports settled, no more fight,
Our MEDS now gleam in data's light.
📚🐇✨

Tip

AI model upgrade

`gpt-4o` model for reviews and chat is now live

OpenAI claims that this model is better at understanding and generating code than the previous models. Please join our Discord Community to provide any feedback or to report any issues.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov-commenter · 2024-06-13T10:28:07Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.73%. Comparing base (cfba4ae) to head (7fe635a).

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #36      +/-   ##
==========================================
+ Coverage   91.72%   91.73%   +0.01%     
==========================================
  Files          13       14       +1     
  Lines         822      823       +1     
==========================================
+ Hits          754      755       +1     
  Misses         68       68

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai

Actionable comments posted: 8

Outside diff range and nitpick comments (3)

src/MEDS_tabular_automl/scripts/tabularize_time_series.py (1)
Line range hint 94-102: Address the loop variable binding issue to prevent potential bugs in asynchronous or concurrent execution.
- for shard_fp, window_size, agg in iter_wrapper(tabularization_tasks):
+ for shard_fp, window_size, agg in iter_wrapper(list(tabularization_tasks)):
Tools

Ruff

15-15: Module level import not at top of file (E402)

16-16: Module level import not at top of file (E402)

18-18: Module level import not at top of file (E402)

19-19: Module level import not at top of file (E402)

20-20: Module level import not at top of file (E402)

21-21: Module level import not at top of file (E402)

22-22: Module level import not at top of file (E402)
src/MEDS_tabular_automl/scripts/tabularize_static.py (1)
Line range hint 142-142: Address the loop variable binding issue to prevent potential bugs in asynchronous or concurrent execution.
- for shard_fp, agg in iter_wrapper(tabularization_tasks):
+ for shard_fp, agg in iter_wrapper(list(tabularization_tasks)):
Tools

Ruff

15-15: Module level import not at top of file (E402)

17-22: Module level import not at top of file (E402)

23-23: Module level import not at top of file (E402)

24-24: Module level import not at top of file (E402)

25-25: Module level import not at top of file (E402)
src/MEDS_tabular_automl/scripts/launch_xgboost.py (1)
Line range hint 90-90: Remove unnecessary True if ... else False in condition.
- code_mask = [True if idx in codes_set else False for idx in feature_ids]
+ code_mask = [idx in codes_set for idx in feature_ids]

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between cfba4ae and 2f72bb8.

Files ignored due to path filters (3)

docs/assets/main_fig.png is excluded by !**/*.png
docs/assets/meds_logo.png is excluded by !**/*.png
docs/assets/pivot.png is excluded by !**/*.png

Files selected for processing (17)

.gitignore (2 hunks)
README.md (3 hunks)
docs/make.bat (2 hunks)
docs/requirements.txt (1 hunks)
docs/source/conf.py (3 hunks)
docs/source/implementation.md (1 hunks)
docs/source/index.md (1 hunks)
docs/source/installation.md (1 hunks)
docs/source/overview.md (1 hunks)
docs/source/prediction.md (1 hunks)
docs/source/profiling.md (1 hunks)
src/MEDS_tabular_automl/scripts/cache_task.py (1 hunks)
src/MEDS_tabular_automl/scripts/describe_codes.py (1 hunks)
src/MEDS_tabular_automl/scripts/launch_xgboost.py (1 hunks)
src/MEDS_tabular_automl/scripts/tabularize_static.py (1 hunks)
src/MEDS_tabular_automl/scripts/tabularize_time_series.py (1 hunks)
src/MEDS_tabular_automl/utils.py (1 hunks)

Files skipped from review due to trivial changes (5)

.gitignore
docs/source/prediction.md
docs/source/profiling.md
src/MEDS_tabular_automl/scripts/cache_task.py
src/MEDS_tabular_automl/scripts/describe_codes.py

Additional context used

Markdownlint

docs/source/implementation.md

3-3: Expected: h2; Actual: h4 (MD001, heading-increment)
Heading levels should only increment by one level at a time

docs/source/installation.md

19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

31-31: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

docs/source/overview.md

14-14: Expected: h2; Actual: h3 (MD001, heading-increment)
Heading levels should only increment by one level at a time

33-33: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

41-41: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

53-53: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

68-68: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

79-79: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

README.md

19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

36-36: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

150-150: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

167-167: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

169-169: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

31-31: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

68-68: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

76-76: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

88-88: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

103-103: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

114-114: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

LanguageTool

docs/source/index.md

[style] ~48-~48: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...ithin the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...

[style] ~48-~48: Using many exclamation marks might seem excessive (in this case: 8 exclamation marks for a text that’s 2711 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!

docs/source/overview.md

[uncategorized] ~24-~24: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

[uncategorized] ~37-~37: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...``` 2. meds-tab-tabularize-static: Filters and processes the dataset based...

[typographical] ~37-~37: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

[uncategorized] ~49-~49: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

[uncategorized] ~64-~64: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

[grammar] ~66-~66: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

[uncategorized] ~77-~77: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...

[uncategorized] ~77-~77: You might be missing the article “the” here. (AI_EN_LECTOR_MISSING_DETERMINER_THE)
Context: ...izesandaggscan be generated usinggenerate-permutations` command (See the ...

[uncategorized] ~90-~90: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... ``` 6. meds-tab-xgboost-sweep: Conducts an Optuna hyperparameter sweep...

[uncategorized] ~94-~94: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

[typographical] ~96-~96: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

[uncategorized] ~96-~96: The preposition “on” seems more likely in this position than the preposition “in”. (AI_EN_LECTOR_REPLACEMENT_PREPOSITION_IN_ON)
Context: ...rectly call generate-permutations in the command line: ```bash genera...

README.md

[uncategorized] ~59-~59: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...``` 2. meds-tab-tabularize-static: Filters and processes the dataset based...

[typographical] ~72-~72: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

[uncategorized] ~84-~84: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

[uncategorized] ~99-~99: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

[grammar] ~101-~101: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

[uncategorized] ~112-~112: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...

[uncategorized] ~125-~125: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... ``` 6. meds-tab-xgboost-sweep: Conducts an Optuna hyperparameter sweep...

[uncategorized] ~129-~129: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

[typographical] ~131-~131: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

Ruff

src/MEDS_tabular_automl/scripts/tabularize_time_series.py

8-8: Module level import not at top of file (E402)

9-9: Module level import not at top of file (E402)

10-10: Module level import not at top of file (E402)

11-11: Module level import not at top of file (E402)

13-13: Module level import not at top of file (E402)

14-14: Module level import not at top of file (E402)

15-15: Module level import not at top of file (E402)

16-16: Module level import not at top of file (E402)

18-18: Module level import not at top of file (E402)

19-19: Module level import not at top of file (E402)

20-20: Module level import not at top of file (E402)

21-21: Module level import not at top of file (E402)

22-22: Module level import not at top of file (E402)

23-30: Module level import not at top of file (E402)

94-94: Function definition does not bind loop variable agg (B023)

101-101: Function definition does not bind loop variable window_size (B023)

102-102: Function definition does not bind loop variable agg (B023)

src/MEDS_tabular_automl/scripts/tabularize_static.py

13-13: Module level import not at top of file (E402)

15-15: Module level import not at top of file (E402)

17-22: Module level import not at top of file (E402)

23-23: Module level import not at top of file (E402)

24-24: Module level import not at top of file (E402)

25-25: Module level import not at top of file (E402)

26-34: Module level import not at top of file (E402)

142-142: Function definition does not bind loop variable agg (B023)

docs/source/conf.py

61-61: Module level import not at top of file (E402)

src/MEDS_tabular_automl/scripts/launch_xgboost.py

90-90: Remove unnecessary True if ... else False (SIM210)

Remove unnecessary True if ... else False

Additional comments not posted (15)

docs/requirements.txt (1)

4-4: Ensure the new dependencies sphinx_immaterial and esgpt are compatible with the project's existing environment.

Also applies to: 14-14

docs/make.bat (1)

13-14: Good addition of a conditional check to guide users when no arguments are provided.

src/MEDS_tabular_automl/scripts/tabularize_time_series.py (1)

18-23: Ensure the relative imports are correctly resolving the modules.

Tools

Ruff

18-18: Module level import not at top of file (E402)

19-19: Module level import not at top of file (E402)

20-20: Module level import not at top of file (E402)

21-21: Module level import not at top of file (E402)

22-22: Module level import not at top of file (E402)
src/MEDS_tabular_automl/scripts/tabularize_static.py (1)
17-26: Ensure the relative imports are correctly resolving the modules.

Verification successful

The relative imports in the src/MEDS_tabular_automl/scripts/tabularize_static.py file are consistent and correctly resolve the modules as they are used across multiple files in the project without any issues.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the resolution of relative imports after refactoring.

# Test: Search for the usage of these modules in the project to ensure no unresolved imports.
fd --exec rg 'from ..describe_codes import'
fd --exec rg 'from ..file_name import'
fd --exec rg 'from ..generate_static_features import'
fd --exec rg 'from ..mapper import'
fd --exec rg 'from ..utils import'
Length of output: 6598
Tools

Ruff

17-22: Module level import not at top of file (E402)

23-23: Module level import not at top of file (E402)

24-24: Module level import not at top of file (E402)

25-25: Module level import not at top of file (E402)
README.md (3)

49-57: The changes in the "Scripts and Examples" section enhance the clarity and usefulness of the documentation. Good job!

127-127: The "Additional CLI Scripts" section is well-written and provides useful information on the new CLI script.

152-152: The explanation of "tabular pipelines" is insightful and helps clarify common misconceptions about structured EHR data.

docs/source/conf.py (5)

15-18: The update to the project information, including the new version number and additional authors, is accurate and up-to-date.

85-96: The addition of new Sphinx extensions will enhance the documentation capabilities, providing more features such as todo items, code views, and coverage reports.

100-100: The change to the "pydata_sphinx_theme" is a good choice for enhancing the visual appeal and functionality of the documentation.

175-175: The simplification of the source suffix to only include ".rst" and ".md" makes the documentation sources more consistent.

135-135: Adding common build and system files to the exclude patterns is a good practice to keep the documentation build clean.

src/MEDS_tabular_automl/utils.py (1)

81-81: The addition of replace=True in the OmegaConf.register_new_resolver call is a good practice to ensure the resolver is correctly registered without conflicts.
src/MEDS_tabular_automl/scripts/launch_xgboost.py (2)
16-18: Relative imports improve modularity and maintainability of the codebase.

16-18: Ensure that the relative paths are correctly resolving to the intended modules.

Verification successful

The relative import paths in launch_xgboost.py are correctly resolving to the intended modules.

get_feature_columns is imported from describe_codes.

get_model_files and list_subdir_files are defined in file_name.py.

get_feature_indices and hydra_loguru_init are defined in utils.py.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify relative import paths resolve correctly.

# Test: Search for the function usage. Expect: Only occurances of the new signature.
ast-grep --lang python --pattern $'from ..describe_codes import get_feature_columns'
Length of output: 189

Script:
#!/bin/bash
# Description: Verify relative import paths resolve correctly.

# Verify the existence of `get_model_files` and `list_subdir_files` in `file_name.py`
ast-grep --lang python --pattern $'def get_model_files($_) { $$$ }' | grep 'file_name.py'
ast-grep --lang python --pattern $'def list_subdir_files($_) { $$$ }' | grep 'file_name.py'

# Verify the existence of `get_feature_indices` and `hydra_loguru_init` in `utils.py`
ast-grep --lang python --pattern $'def get_feature_indices($_) { $$$ }' | grep 'utils.py'
ast-grep --lang python --pattern $'def hydra_loguru_init($_) { $$$ }' | grep 'utils.py'
Length of output: 2179

Script:
#!/bin/bash
# Description: Verify relative import paths resolve correctly.

# Verify the existence of `get_model_files` and `list_subdir_files` in `file_name.py`
ast-grep --lang python --pattern 'def get_model_files' | grep 'file_name.py'
ast-grep --lang python --pattern 'def list_subdir_files' | grep 'file_name.py'

# Verify the existence of `get_feature_indices` and `hydra_loguru_init` in `utils.py`
ast-grep --lang python --pattern 'def get_feature_indices' | grep 'utils.py'
ast-grep --lang python --pattern 'def hydra_loguru_init' | grep 'utils.py'
Length of output: 1687

Script:
#!/bin/bash
# Description: Verify relative import paths resolve correctly.

# Verify the existence of `get_model_files` and `list_subdir_files` in `file_name.py`
rg 'def get_model_files' | grep 'file_name.py'
rg 'def list_subdir_files' | grep 'file_name.py'

# Verify the existence of `get_feature_indices` and `hydra_loguru_init` in `utils.py`
rg 'def get_feature_indices' | grep 'utils.py'
rg 'def hydra_loguru_init' | grep 'utils.py'
Length of output: 577

coderabbitai · 2024-06-13T10:31:07Z

docs/source/index.md

+
+- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space!
+
+By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!


Consider rephrasing for conciseness and clarity.

- This approach not only simplifies the process but also ensures high-quality, reproducible results for ... + This approach simplifies the process and ensures high-quality, reproducible results for ...

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!

By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach simplifies the process and ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!

Tools

LanguageTool

[style] ~48-~48: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...ithin the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...

[style] ~48-~48: Using many exclamation marks might seem excessive (in this case: 8 exclamation marks for a text that’s 2711 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!

coderabbitai · 2024-06-13T10:31:07Z

docs/source/implementation.md

@@ -0,0 +1,14 @@
+# How does MEDS-Tab Work?
+
+#### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?


Adjust the heading level to increment by one from the previous level for better readability and structure.

-#### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular? +### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?

### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?

Tools

Markdownlint

3-3: Expected: h2; Actual: h4 (MD001, heading-increment)
Heading levels should only increment by one level at a time

coderabbitai · 2024-06-13T10:31:07Z

docs/source/installation.md

+
+**Local Install**
+
+```


Specify the language for fenced code blocks to enhance readability and syntax highlighting.

-``` +```bash

Tools

Markdownlint

31-31: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

coderabbitai · 2024-06-13T10:31:08Z

README.md


 See `tests/test_integration.py` for an example of the end-to-end pipeline being run on synthetic data. This
 script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.

 For an end to end example over MIMIC-IV, see the [companion repository](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV)
 For an end to end example over Philips eICU, see the [eICU companion repository](https://github.com/mmcdermott/MEDS_TAB_EICU).

-### Core CLI Scripts Overview
+## Core CLI Scripts Overview

 1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types


Please review the punctuation and grammar in the descriptions of the CLI scripts to ensure clarity and professionalism.

- **`meds-tab-describe`**: This command processes MEDS data shards... + **`meds-tab-describe`**: This command processes MEDS data shards... - **`meds-tab-tabularize-static`**: Filters and processes the dataset based... + **`meds-tab-tabularize-static`**: Filters and processes the dataset based... - **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard... + **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard... - **`meds-tab-xgboost`**: Trains an XGBoost model using user-specific parameters. + **`meds-tab-xgboost`**: Trains an XGBoost model using user-specific parameters. - **`meds-tab-xgboost-sweep`**: Conducts an Optuna hyperparameter sweep... + **`meds-tab-xgboost-sweep`**: Conducts an Optuna hyperparameter sweep... - **`generate-permutations`**: Generates and prints a sorted list of all permutations... + **`generate-permutations`**: Generates and prints a sorted list of all permutations...

Also applies to: 72-72, 84-84, 99-99, 112-112, 125-125, 129-129

Tools

LanguageTool

[uncategorized] ~59-~59: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

coderabbitai · 2024-06-13T10:31:08Z

docs/source/overview.md

+
+### Additional CLI Scripts
+
+1. **`generate-permutations`**: Generates and prints a sorted list of all permutations from a comma separated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations.


Address markdown linting issues by specifying a language for fenced code blocks.

- ```bash + ```bash

Tools

LanguageTool

[uncategorized] ~94-~94: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

coderabbitai · 2024-06-13T10:31:08Z

docs/source/overview.md

+
+4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
+
+   **Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)


Correct the grammar to improve readability.

- Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`. + Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.

Committable suggestion was skipped due to low confidence.

Tools

LanguageTool

[grammar] ~66-~66: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 2f72bb8 and b0365a6.

Files selected for processing (1)

docs/requirements.txt (1 hunks)

Files skipped from review as they are similar to previous changes (1)

docs/requirements.txt

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b0365a6 and b2655a6.

Files selected for processing (1)

.readthedocs.yaml (1 hunks)

Files skipped from review due to trivial changes (1)

.readthedocs.yaml

coderabbitai

Actionable comments posted: 8

Outside diff range and nitpick comments (1)

README.md (1)
Line range hint 137-137: Add a comma after "for example" for grammatical correctness.
- For example you can directly call **`generate-permutations`** in the command line:
+ For example, you can directly call **`generate-permutations`** in the command line:
Tools

LanguageTool

[uncategorized] ~69-~69: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

[uncategorized] ~78-~78: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

[uncategorized] ~92-~92: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

Markdownlint

47-47: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

53-53: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b2655a6 and 85bfd5e.

Files selected for processing (3)

.readthedocs.yaml (1 hunks)
README.md (6 hunks)
docs/source/index.md (1 hunks)

Files skipped from review as they are similar to previous changes (1)

.readthedocs.yaml

Additional context used

LanguageTool

docs/source/index.md

[style] ~34-~34: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...

[style] ~34-~34: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 1704 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!

README.md

[uncategorized] ~69-~69: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

[uncategorized] ~78-~78: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

[uncategorized] ~92-~92: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

[uncategorized] ~107-~107: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

[grammar] ~109-~109: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

[uncategorized] ~120-~120: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...

[uncategorized] ~135-~135: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

[typographical] ~137-~137: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

[style] ~170-~170: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

[typographical] ~202-~202: It appears that a comma is missing. (DURING_THAT_TIME_COMMA)
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...

[uncategorized] ~208-~208: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...

[style] ~265-~265: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative. (A_LOT_OF)
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...

[style] ~294-~294: Consider using a different verb to strengthen your wording. (SPEED_UP_ACCELERATE)
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

[uncategorized] ~330-~330: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... more memory efficient version of their method which we denote catabra-mem. Other li...

[uncategorized] ~330-~330: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...

[style] ~334-~334: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

[style] ~334-~334: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...

[uncategorized] ~334-~334: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...

[uncategorized] ~462-~462: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

Markdownlint

README.md

19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

36-36: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

185-185: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

309-309: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

400-400: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

47-47: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

53-53: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

520-520: null (MD038, no-space-in-code)
Spaces inside code span elements

31-31: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

215-215: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

276-276: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

414-414: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

484-484: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

Additional comments not posted (4)

docs/source/index.md (3)

3-3: Ensure the introduction is engaging and clearly states the purpose of MEDS-Tab.

5-15: The table of contents is well-structured and provides a clear path for navigating the documentation.

19-21: This section effectively highlights the benefits of using MEDS-Tab.

README.md (1)

43-45: Ensure consistency in documentation.

The installation instructions are duplicated here which might confuse users. Consider removing this section or integrating it with the primary installation instructions at the beginning of the document.

coderabbitai · 2024-06-13T11:57:42Z

docs/source/index.md

+- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions
+- Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks.
+


The bullet points here succinctly communicate the advantages of MEDS-Tab. Consider adding specific metrics or examples to further bolster these claims.

- - Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks. + - Strong Performance: MEDS-Tab consistently outperforms other frameworks in benchmarks such as [specific benchmark], demonstrating its robustness and high performance.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions

- Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks.

- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions

- Strong Performance: MEDS-Tab consistently outperforms other frameworks in benchmarks such as [specific benchmark], demonstrating its robustness and high performance.

coderabbitai · 2024-06-13T11:57:42Z

README.md

+1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date.
+2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems.


Rephrase the description for clarity and conciseness.

- 1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date. + 1. Efficient construction and loading of flat, non-longitudinal summary features for patient records in MEDS, covering arbitrary time-windows (e.g., 1 year, 6 months) backward from a specified index date.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date.

2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems.

1. Efficient construction and loading of flat, non-longitudinal summary features for patient records in MEDS, covering arbitrary time-windows (e.g., 1 year, 6 months) backward from a specified index date.

2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems.

coderabbitai · 2024-06-13T11:57:43Z

README.md

@@ -98,9 +106,9 @@

 4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.

-   **Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)
+   **Example: Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)


Correct grammatical error in the description.

- **Example: Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES) + **Example: Align Tabularized Data**: For a specific task `$TASK`, align labels that have been pulled from [ACES](https://github.com/justin13601/ACES).

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

**Example: Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)

**Example: Align Tabularized Data**: For a specific task `$TASK`, align labels that have been pulled from [ACES](https://github.com/justin13601/ACES).

Tools

LanguageTool

[grammar] ~109-~109: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

coderabbitai · 2024-06-13T11:57:43Z

README.md

 2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.

-   **Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
+   **Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d,  365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`


Clarify and simplify the example description.

- **Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]` + **Example: Tabularizing Static Data**: Minimum code frequency: 10, Window sizes: `[1d, 30d, 365d, full]`, Aggregation methods: `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.

**Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`

**Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`

2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.

**Example: Tabularizing Static Data**: Minimum code frequency: 10, Window sizes: `[1d, 30d, 365d, full]`, Aggregation methods: `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`

Tools

LanguageTool

[uncategorized] ~78-~78: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

coderabbitai

Actionable comments posted: 14

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 85bfd5e and f6e229a.

Files selected for processing (4)

docs/source/implementation.md (1 hunks)
docs/source/overview.md (1 hunks)
docs/source/prediction.md (1 hunks)
docs/source/profiling.md (1 hunks)

Additional context used

LanguageTool

docs/source/implementation.md

[typographical] ~3-~3: It appears that a comma is missing. (DURING_THAT_TIME_COMMA)
Context: ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...

[uncategorized] ~9-~9: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...

[style] ~66-~66: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative. (A_LOT_OF)
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...

[uncategorized] ~74-~74: A determiner appears to be missing. Consider inserting it. (AI_EN_LECTOR_MISSING_DETERMINER)
Context: ...ow sizes. 5. Output Storage: - Sparse array is converted to Coordinate List f...

[style] ~95-~95: Consider using a different verb to strengthen your wording. (SPEED_UP_ACCELERATE)
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

docs/source/profiling.md

[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... more memory efficient version of their method which we denote catabra-mem. Other li...

[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...

[style] ~26-~26: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

[style] ~26-~26: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...

[uncategorized] ~26-~26: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... that catabra-mem is omitted from the tables as it never completed within the 10 min...

[uncategorized] ~26-~26: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...

docs/source/overview.md

[uncategorized] ~34-~34: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

[uncategorized] ~57-~57: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

[uncategorized] ~85-~85: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...

[uncategorized] ~85-~85: You might be missing the article “the” here. (AI_EN_LECTOR_MISSING_DETERMINER_THE)
Context: ...izesandaggscan be generated usinggenerate-permutations` command (See the ...

[uncategorized] ~100-~100: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

[uncategorized] ~102-~102: The preposition “on” seems more likely in this position than the preposition “in”. (AI_EN_LECTOR_REPLACEMENT_PREPOSITION_IN_ON)
Context: ...rectly call generate-permutations in the command line: ```bash genera...

[uncategorized] ~125-~125: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...

[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

docs/source/prediction.md

[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

Markdownlint

docs/source/implementation.md

16-16: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

77-77: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

docs/source/profiling.md

90-90: null (MD047, single-trailing-newline)
Files should end with a single newline character

docs/source/overview.md

150-150: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

docs/source/prediction.md

121-121: null (MD038, no-space-in-code)
Spaces inside code span elements

15-15: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

85-85: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

coderabbitai · 2024-06-13T12:05:06Z

docs/source/implementation.md

+
+This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients:
+
+```


Specify language for fenced code blocks to adhere to Markdown best practices.

- ``` + ```plaintext

Also applies to: 77-77

Tools

Markdownlint

16-16: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

coderabbitai · 2024-06-13T12:05:06Z

docs/source/implementation.md

+
+1. Describe codes (compute feature frequencies)
+2. Given time series data tabularize it
+3. cache task specific rows of data for efficient loading


Hyphenate "task-specific" for grammatical correctness.

- 3. cache task specific rows of data for efficient loading + 3. cache task-specific rows of data for efficient loading

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

3. cache task specific rows of data for efficient loading

3. cache task-specific rows of data for efficient loading

Tools

LanguageTool

[uncategorized] ~9-~9: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...

coderabbitai · 2024-06-13T12:05:07Z

docs/source/implementation.md

+
+5. **Output Storage**:
+
+   - Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.


Insert "a" before "Sparse array" to correct the determiner omission.

- Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk. + A Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.

- A Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.

Tools

LanguageTool

[uncategorized] ~74-~74: A determiner appears to be missing. Consider inserting it. (AI_EN_LECTOR_MISSING_DETERMINER)
Context: ...ow sizes. 5. Output Storage: - Sparse array is converted to Coordinate List f...

coderabbitai · 2024-06-13T12:05:07Z

docs/source/profiling.md

+
+## MEDS-Tab Tabularization Technique
+
+Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMICIV and EICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for EICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.


Consider adding a comma after "method" for better readability.

- ...more memory efficient version of their method which we denote `catabra-mem`. + ...more memory efficient version of their method, which we denote `catabra-mem`.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMICIV and EICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for EICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.

Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method, which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMICIV and EICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for EICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.

Tools

LanguageTool

[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... more memory efficient version of their method which we denote catabra-mem. Other li...

[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...

coderabbitai · 2024-06-13T12:05:07Z

docs/source/profiling.md

+| --------- | ----------- | ----------- | ----------- | -------- |
+| 0m15.867s | 1,410.79 MB | 3,539.32 MB | 442 MB      | meds_tab |
+
+______________________________________________________________________


Ensure the file ends with a single newline character.

+ \n

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

______________________________________________________________________

______________________________________________________________________

Tools

Markdownlint

90-90: null (MD047, single-trailing-newline)
Files should end with a single newline character

coderabbitai · 2024-06-13T12:05:08Z

docs/source/overview.md

+3. We need to ensure full and seamless compatibility with the ACES CLI tool, rather than relying on the python API and manual adjustments:
+   [#34](https://github.com/mmcdermott/MEDS_Tabular_AutoML/issues/34)
+
+# How does MEDS-Tab Work?


Avoid multiple top-level headings in the same document to adhere to Markdown best practices.

- # How does MEDS-Tab Work? + ## How does MEDS-Tab Work?

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# How does MEDS-Tab Work?

## How does MEDS-Tab Work?

Tools

Markdownlint

150-150: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

coderabbitai · 2024-06-13T12:05:08Z

docs/source/prediction.md

+
+A single XGBoost run was completed to profile time and memory usage. This was done for each `$TASK` using the following command:
+
+```


Specify language for fenced code blocks to adhere to Markdown best practices.

- ``` + ```bash

Also applies to: 85-85

Tools

Markdownlint

15-15: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

coderabbitai · 2024-06-13T12:05:08Z

docs/source/prediction.md

+| LOS in Hospital > 3 days        | Admission + 24 hr | 6m4.884s  | 7m5.025s  | 1m4.335s  | 11011.710        | 12223.449         |
+| LOS in Hospital > 3 days        | Admission + 48 hr | 6m9.587s  | 7m12.853s | 1m3.858s  | 10703.064        | 11829.742         |
+
+#### 1.1.2 MIMIC-IV Task Specific Training Cohort Size


Hyphenate "Task-Specific" for grammatical correctness.

- #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size + #### 1.1.2 MIMIC-IV Task-Specific Training Cohort Size

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#### 1.1.2 MIMIC-IV Task Specific Training Cohort Size

#### 1.1.2 MIMIC-IV Task-Specific Training Cohort Size

Tools

LanguageTool

[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between f6e229a and 0227900.

Files selected for processing (1)

docs/source/prediction.md (1 hunks)

Files not reviewed due to errors (1)

docs/source/prediction.md (no review received)

Additional context used

LanguageTool

docs/source/prediction.md

[uncategorized] ~58-~58: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

[uncategorized] ~186-~186: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ... | 14 | #### 2.1.3 eICU Task Specific Training Cohort Size | Task ...

Markdownlint

docs/source/prediction.md

2-2: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces

4-4: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces

42-42: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces

110-110: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces

113-113: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces

132-132: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces

152-152: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces

112-112: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines

113-113: Expected: 1; Actual: 3 (MD012, no-multiple-blanks)
Multiple consecutive blank lines

114-114: Expected: 1; Actual: 4 (MD012, no-multiple-blanks)
Multiple consecutive blank lines

134-134: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines

185-185: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines

198-198: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines

199-199: Expected: 1; Actual: 3 (MD012, no-multiple-blanks)
Multiple consecutive blank lines

76-76: null (MD019, no-multiple-space-atx)
Multiple spaces after hash on atx style heading

43-43: Expected: 1; Actual: 0; Above (MD022, blanks-around-headings)
Headings should be surrounded by blank lines

43-43: Expected: 1; Actual: 0; Below (MD022, blanks-around-headings)
Headings should be surrounded by blank lines

135-135: Expected: 1; Actual: 0; Below (MD022, blanks-around-headings)
Headings should be surrounded by blank lines

152-152: Expected: 1; Actual: 0; Below (MD022, blanks-around-headings)
Headings should be surrounded by blank lines

153-153: Expected: 1; Actual: 0; Above (MD022, blanks-around-headings)
Headings should be surrounded by blank lines

186-186: Expected: 1; Actual: 0; Below (MD022, blanks-around-headings)
Headings should be surrounded by blank lines

15-15: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines

21-21: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines

23-23: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines

41-41: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines

79-79: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines

87-87: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines

89-89: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines

96-96: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines

98-98: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines

109-109: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines

132-132: null (MD032, blanks-around-lists)
Lists should be surrounded by blank lines

110-110: null (MD038, no-space-in-code)
Spaces inside code span elements

15-15: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

79-79: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

coderabbitai

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 0227900 and 1ab0405.

Files selected for processing (4)

docs/source/index.md (1 hunks)
docs/source/overview.md (1 hunks)
docs/source/prediction.md (1 hunks)
docs/source/profiling.md (1 hunks)

Files not reviewed due to errors (1)

docs/source/prediction.md (no review received)

Additional context used

LanguageTool

docs/source/index.md

[style] ~34-~34: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...

[style] ~34-~34: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 1704 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!

docs/source/profiling.md

[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...d that on the MIMIC-IV and eICU medical datasets we significantly outperform past method...

[style] ~26-~26: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

[style] ~26-~26: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...

[uncategorized] ~26-~26: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...

docs/source/overview.md

[uncategorized] ~34-~34: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

[uncategorized] ~57-~57: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

[uncategorized] ~85-~85: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...

[uncategorized] ~100-~100: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

[uncategorized] ~125-~125: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...

[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

docs/source/prediction.md

[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

[uncategorized] ~195-~195: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ... | 14 | #### 2.1.3 eICU Task Specific Training Cohort Size | Task ...

Markdownlint

docs/source/overview.md

12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

docs/source/prediction.md

121-121: null (MD038, no-space-in-code)
Spaces inside code span elements

15-15: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

85-85: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

Additional comments not posted (2)

docs/source/index.md (1)
25-27: The bullet points here succinctly communicate the advantages of MEDS-Tab. Consider adding specific metrics or examples to further bolster these claims.
- - Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks.
+ - Strong Performance: MEDS-Tab consistently outperforms other frameworks in benchmarks such as [specific benchmark], demonstrating its robustness and high performance.
Likely invalid or redundant comment.
docs/source/overview.md (1)
74-74: Correct the grammar to improve readability.
- Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
+ Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
Likely invalid or redundant comment.

Tools

LanguageTool

[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

coderabbitai · 2024-06-18T18:02:32Z

docs/source/index.md

+MEDS-Tab is a library designed for automated tabularization, data preparation with aggregation, and windowing. Check out below for an overview of MEDS-Tab and how it could be useful in your workflows!
+
+```{toctree}
+---
+glob:
+maxdepth: 2
+---
+Overview <overview>
+Pipeline/Architecture Details <implementation>
+Computational Profile <profiling>
+Prediction Performance <prediction>
+Module API Reference <api/modules>
+```
+
+______________________________________________________________________
+
+## Why MEDS-Tab?
+
+MEDS-Tab is a comprehensive framework designed to streamline the handling, modeling, and analysis of complex medical time-series data. By leveraging automated processes, MEDS-Tab significantly reduces the computation required to generate high-quality baseline models for diverse supervised learning tasks.
+
+### I. Transform to MEDS
+
+- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions
+- Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks.
+
+### II. Run MEDS-Tab
+
+- Run the MEDS-Tab Command-Line Interface tool (`MEDS-Tab-cli`) to extract cohorts based on your task - check out the [Usage Guide](https://meds-tab--36.org.readthedocs.build/en/36/overview.html#core-cli-scripts-overview)!
+
+- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space!
+
+By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!


Consider rephrasing for conciseness and clarity.

- This approach not only simplifies the process but also ensures high-quality, reproducible results for ... + This approach simplifies the process and ensures high-quality, reproducible results for ...

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

MEDS-Tab is a library designed for automated tabularization, data preparation with aggregation, and windowing. Check out below for an overview of MEDS-Tab and how it could be useful in your workflows!

```{toctree}

---

glob:

maxdepth: 2

---

Overview <overview>

Pipeline/Architecture Details <implementation>

Computational Profile <profiling>

Prediction Performance <prediction>

Module API Reference <api/modules>

```

______________________________________________________________________

## Why MEDS-Tab?

MEDS-Tab is a comprehensive framework designed to streamline the handling, modeling, and analysis of complex medical time-series data. By leveraging automated processes, MEDS-Tab significantly reduces the computation required to generate high-quality baseline models for diverse supervised learning tasks.

### I. Transform to MEDS

- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions

- Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks.

### II. Run MEDS-Tab

- Run the MEDS-Tab Command-Line Interface tool (`MEDS-Tab-cli`) to extract cohorts based on your task - check out the [Usage Guide](https://meds-tab--36.org.readthedocs.build/en/36/overview.html#core-cli-scripts-overview)!

- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space!

By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!

By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the MEDS-Tab ecosystem. This approach simplifies the process and ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!

Tools

LanguageTool

[style] ~34-~34: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...

[style] ~34-~34: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 1704 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!

coderabbitai · 2024-06-18T18:02:32Z

docs/source/profiling.md

+
+## 2. Comparative Performance Analysis
+
+The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.


Replace "very" with a more precise adjective.

- ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. + ...use a budget of 10 minutes as these are a small number of patients (10, 100, and 500 patients), and should be processed quickly.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.

The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are a small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.

Tools

LanguageTool

[style] ~26-~26: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

[style] ~26-~26: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...

[uncategorized] ~26-~26: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...

coderabbitai · 2024-06-18T18:02:32Z

docs/source/profiling.md

+Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMIC-IV and eICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.
+
+## 2. Comparative Performance Analysis
+
+The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.


Remove "of" after "all" for conciseness.

- ...ing the better performance of MEDS-Tab in all of the scenarios. + ...ing the better performance of MEDS-Tab in all the scenarios.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMIC-IV and eICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.

## 2. Comparative Performance Analysis

The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.

Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMIC-IV and eICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.

## 2. Comparative Performance Analysis

The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.

Tools

LanguageTool

[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...d that on the MIMIC-IV and eICU medical datasets we significantly outperform past method...

[style] ~26-~26: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

[style] ~26-~26: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...

[uncategorized] ~26-~26: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...

… docs

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 215ab04 and 88320ea.

Files selected for processing (2)

README.md (3 hunks)
docs/source/prediction.md (1 hunks)

Files not reviewed due to errors (1)

docs/source/prediction.md (no review received)

Additional context used

LanguageTool

docs/source/prediction.md

[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ... 11829.742 | #### 1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

[uncategorized] ~195-~195: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ... | 14 | #### 3. eICU Task Specific Training Cohort Size | Task ...

README.md

[uncategorized] ~69-~69: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

[uncategorized] ~78-~78: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

[uncategorized] ~92-~92: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

[uncategorized] ~107-~107: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

[grammar] ~109-~109: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

[uncategorized] ~120-~120: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...

[uncategorized] ~135-~135: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

[typographical] ~137-~137: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

[uncategorized] ~160-~160: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...

[style] ~170-~170: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

[typographical] ~202-~202: It appears that a comma is missing. (DURING_THAT_TIME_COMMA)
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...

[uncategorized] ~208-~208: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...

[style] ~265-~265: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative. (A_LOT_OF)
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...

[style] ~294-~294: Consider using a different verb to strengthen your wording. (SPEED_UP_ACCELERATE)
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

[uncategorized] ~330-~330: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...

[style] ~334-~334: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

[style] ~334-~334: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...

[uncategorized] ~334-~334: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... that catabra-mem is omitted from the tables as it never completed within the 10 min...

[uncategorized] ~334-~334: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...

[uncategorized] ~462-~462: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

Markdownlint

docs/source/prediction.md

121-121: null (MD038, no-space-in-code)
Spaces inside code span elements

README.md

19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

36-36: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

185-185: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

309-309: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

400-400: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

47-47: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

53-53: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

520-520: null (MD038, no-space-in-code)
Spaces inside code span elements

215-215: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

276-276: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

414-414: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

484-484: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

Additional comments not posted (9)

README.md (9)
25-25: Installation command is clear and concise.

This line provides a straightforward command for installing the package via pip, which is crucial for user accessibility.

31-31: Local installation instructions are clear.

The instructions for cloning the repository and installing the package locally are clear, promoting easy setup for developers.

76-76: Explanation of caching feature names and frequencies is clear.

The documentation clearly explains how the meds-tab-describe script computes and caches feature names and frequencies, which is essential for understanding the data processing workflow.

40-42: Clarify the description of tabular feature construction and usage.

The description of constructing and using tabular features can be enhanced for better clarity and precision.
- 1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date.
+ 1. Efficient construction and loading of flat, non-longitudinal summary features for patient records in MEDS, covering arbitrary time-windows (e.g., 1 year, 6 months) backward from a specified index date.
80-80: Example command for static data tabularization needs clarification.

The example command provided for static data tabularization is detailed but could be simplified for better readability.
- **Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d,  365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
+ **Example: Tabularizing Static Data**: Minimum code frequency: 10, Window sizes: `[1d, 30d, 365d, full]`, Aggregation methods: `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
109-109: Grammar correction needed in example description.

The description has a grammatical error that needs correction for clarity.
- **Example: Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)
+ **Example: Align Tabularized Data**: For a specific task `$TASK`, align labels that have been pulled from [ACES](https://github.com/justin13601/ACES).
Tools

LanguageTool

[grammar] ~109-~109: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

133-133: Clarify the functionality of the generate-permutations command.

The description of the generate-permutations command can be improved for better understanding.
- 1. **`generate-permutations`**: Generates and prints a sorted list of all permutations from a comma separated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations.
+ 1. **`generate-permutations`**: Generates and prints all permutations from a comma-separated input, aiding in sweeping over combinations of window sizes and aggregations.
170-170: Simplify language for clarity.

The language used in the description can be simplified for better readability.
- We should likely decorrelate the default aggregations and/or window sizes we use prior to passing them into the models as features.
+ We should likely decorrelate the default aggregations and window sizes before using them as features in the models.
Tools

LanguageTool

[style] ~170-~170: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

202-202: Insert a comma for correct punctuation.

A comma is missing in the sentence, which is necessary for correct punctuation and readability.
- In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
+ In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
Tools

LanguageTool

[typographical] ~202-~202: It appears that a comma is missing. (DURING_THAT_TIME_COMMA)
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...

coderabbitai

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 88320ea and 4a57e11.

Files selected for processing (2)

docs/source/installation.md (1 hunks)
docs/source/overview.md (1 hunks)

Additional context used

Markdownlint

docs/source/installation.md

19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

docs/source/overview.md

12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

LanguageTool

docs/source/overview.md

[uncategorized] ~34-~34: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

[uncategorized] ~57-~57: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

[uncategorized] ~85-~85: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...

[uncategorized] ~100-~100: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

Additional comments not posted (1)

docs/source/overview.md (1)
102-102: Add a comma after introductory phrase for clarity.

The phrase "For example" should be followed by a comma to improve readability.
- For example you can directly call **`generate-permutations`**
+ For example, you can directly call **`generate-permutations`**
Likely invalid or redundant comment.

Tools

LanguageTool

[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

coderabbitai · 2024-06-24T20:46:33Z

docs/source/overview.md

+
+   This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.
+
+2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.


Clarify conjunction usage in sentence.

The use of "thus" in the description of meds-tab-tabularize-static might confuse readers as it suggests a cause-effect relationship not clearly stated.

- thus rows are duplicated across multiple timestamps for the same patient. + Consequently, rows are duplicated across multiple timestamps for the same patient.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.

2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, consequently, rows are duplicated across multiple timestamps for the same patient.

Tools

LanguageTool

[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

This reverts commit 4a57e11.

coderabbitai

Actionable comments posted: 6

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 4a57e11 and ba90809.

Files selected for processing (2)

docs/source/installation.md (1 hunks)
docs/source/overview.md (1 hunks)

Additional context used

Markdownlint

docs/source/installation.md

19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document

23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

docs/source/overview.md

12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

LanguageTool

docs/source/overview.md

[uncategorized] ~34-~34: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

[uncategorized] ~57-~57: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

[uncategorized] ~85-~85: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...

[uncategorized] ~100-~100: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

coderabbitai · 2024-06-24T21:45:36Z

docs/source/installation.md

+# Scalable tabularization and tabular feature usage utilities over generic MEDS datasets
+
+<p align="center">
+  <a href="https://www.python.org/downloads/release/python-3100/"><img alt="Python" src="https://img.shields.io/badge/-Python_3.12+-blue?logo=python&logoColor=white"></a>
+  <a href="https://pypi.org/project/meds-tab/"><img alt="PyPI" src="https://img.shields.io/badge/PyPI-v0.2.5-orange?logoColor=orange"></a>
+  <a href="https://hydra.cc/"><img alt="Hydra" src="https://img.shields.io/badge/Config-Hydra_1.3-89b8cd"></a>
+  <a href="https://codecov.io/gh/mmcdermott/MEDS_Tabular_AutoML"><img alt="Codecov" src="https://codecov.io/gh/mmcdermott/MEDS_Tabular_AutoML/graph/badge.svg?token=6GD05EDQ39"></a>
+  <a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML/actions/workflows/tests.yaml"><img alt="Tests" src="https://github.com/mmcdermott/MEDS_Tabular_AutoML/actions/workflows/tests.yaml/badge.svg"></a>
+  <a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML/actions/workflows/code-quality-main.yaml"><img alt="Code Quality" src="https://github.com/mmcdermott/MEDS_Tabular_AutoML/actions/workflows/code-quality-main.yaml/badge.svg"></a>
+  <a href='https://meds-tab.readthedocs.io/en/latest/?badge=latest'><img src='https://readthedocs.org/projects/meds-tab/badge/?version=latest' alt='Documentation Status' /></a>
+  <a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML/graphs/contributors"><img alt="Contributors" src="https://img.shields.io/github/contributors/mmcdermott/MEDS_Tabular_AutoML.svg"></a>
+  <a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML/pulls"><img alt="Pull Requests" src="https://img.shields.io/badge/PRs-welcome-brightgreen.svg"></a>
+  <a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML#license"><img alt="License" src="https://img.shields.io/badge/License-MIT-green.svg?labelColor=gray"></a>


Improve badge alignment and presentation.

Consider using a table or a more structured layout for the badges to enhance readability and visual appeal.

coderabbitai · 2024-06-24T21:45:36Z

docs/source/overview.md

+# Usage
+


Clarify the repository's purpose in the introduction.

The introduction could be expanded to provide more details about the specific capabilities and advantages of the repository.

coderabbitai · 2024-06-24T21:45:37Z

docs/source/overview.md

+## Quick Start
+
+To use MEDS-Tab, install the dependencies following commands below:
+
+**Pip Install**
+
+```console
+pip install meds-tab
+```
+
+**Local Install**
+
+```console
+# clone the git repo
+pip install .
+```
+
+## Scripts and Examples
+
+For an end to end example over MIMIC-IV, see the [MIMIC-IV companion repository](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV).
+For an end to end example over Philips eICU, see the [eICU companion repository](https://github.com/mmcdermott/MEDS_TAB_EICU).
+
+See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.


Enhance the Quick Start and Scripts sections with more detailed examples.

Providing more detailed examples and explanations can help new users better understand how to use the repository effectively.

Tools

Markdownlint

12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

coderabbitai · 2024-06-24T21:45:37Z

docs/source/overview.md

+
+To use MEDS-Tab, install the dependencies following commands below:
+
+**Pip Install**


Convert emphasized text to headings for clarity and consistency.

Using headings instead of bold text for section titles can improve the structure and readability of the document.

-**Pip Install** +### Pip Install -**Local Install** +### Local Install

Also applies to: 18-18

Tools

Markdownlint

12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

coderabbitai

Actionable comments posted: 18

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between ba90809 and c6cc059.

Files selected for processing (5)

README.md (3 hunks)
docs/source/implementation.md (1 hunks)
docs/source/installation.md (1 hunks)
docs/source/profiling.md (1 hunks)
pyproject.toml (1 hunks)

Files skipped from review due to trivial changes (1)

pyproject.toml

Additional context used

Markdownlint

docs/source/installation.md

19-19: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)

23-23: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

29-29: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

README.md

23-23: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

29-29: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

185-185: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)

47-47: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

53-53: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

309-309: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)

400-400: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)

520-520: null
Spaces inside code span elements

(MD038, no-space-in-code)

215-215: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

276-276: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

414-414: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

484-484: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

LanguageTool

docs/source/implementation.md

[uncategorized] ~14-~14: Possible missing comma found.
Context: ...reded dataset. We expect a structure as follows where each shard contains a subset of t...

(AI_HYDRA_LEO_MISSING_COMMA)

[style] ~64-~64: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative.
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...

(A_LOT_OF)

[style] ~93-~93: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

(SPEED_UP_ACCELERATE)

docs/source/profiling.md

[uncategorized] ~20-~20: Possible missing comma found.
Context: ...w that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-...

(AI_HYDRA_LEO_MISSING_COMMA)

[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)

[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)

README.md

[uncategorized] ~69-~69: Loose punctuation mark.
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~78-~78: Loose punctuation mark.
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

(UNLIKELY_OPENING_PUNCTUATION)

[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

(THUS_SENTENCE)

[uncategorized] ~92-~92: Loose punctuation mark.
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~107-~107: Loose punctuation mark.
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

(UNLIKELY_OPENING_PUNCTUATION)

[grammar] ~109-~109: Possible subject-verb agreement error detected.
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

(PLURAL_THAT_AGREEMENT)

[uncategorized] ~120-~120: Loose punctuation mark.
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~135-~135: Loose punctuation mark.
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

(UNLIKELY_OPENING_PUNCTUATION)

[typographical] ~137-~137: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

(COMMA_FOR_EXAMPLE)

[uncategorized] ~160-~160: Possible missing comma found.
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...

(AI_HYDRA_LEO_MISSING_COMMA)

[style] ~170-~170: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

(EN_WORDINESS_PREMIUM_PRIOR_TO)

[typographical] ~202-~202: It appears that a comma is missing.
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...

(DURING_THAT_TIME_COMMA)

[uncategorized] ~208-~208: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...

(SPECIFIC_HYPHEN)

[uncategorized] ~213-~213: Possible missing comma found.
Context: ...reded dataset. We expect a structure as follows where each shard contains a subset of t...

(AI_HYDRA_LEO_MISSING_COMMA)

[style] ~265-~265: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative.
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...

(A_LOT_OF)

[style] ~294-~294: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

(SPEED_UP_ACCELERATE)

[uncategorized] ~330-~330: Possible missing comma found.
Context: ... more memory efficient version of their method which we denote catabra-mem. Other li...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~330-~330: Possible missing comma found.
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...

(AI_HYDRA_LEO_MISSING_COMMA)

[style] ~334-~334: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)

[style] ~334-~334: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...

(EN_WEAK_ADJECTIVE)

[uncategorized] ~334-~334: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...

(MISSING_HYPHEN)

[uncategorized] ~462-~462: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

(SPECIFIC_HYPHEN)

Additional comments not posted (10)

docs/source/installation.md (2)

3-13: Improve badge alignment and presentation.

Consider using a table or a more structured layout for the badges to enhance readability and visual appeal.

16-17: LGTM!

The description of the repository is clear and concise.

docs/source/implementation.md (4)

5-10: LGTM!

The description of the method's discrete parts is clear and concise.

36-37: LGTM!

The description of the data loading and sharding process is clear and concise.

70-72: LGTM!

The description of the output storage process is clear and concise.

100-106: LGTM!

The description of the XGBoost training process is clear and concise.

docs/source/profiling.md (3)

1-3: LGTM!

The heading and introductory paragraph are correctly formatted.

5-11: LGTM!

The description of the system comparison overview is clear and concise.

12-17: LGTM!

The description of the benchmarking tests is clear and concise.
README.md (1)
40-41: Rephrase for clarity and conciseness.

Improve the description for better readability.
- 1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date.
+ 1. Efficient construction and loading of flat, non-longitudinal summary features for patient records in MEDS, covering arbitrary time-windows (e.g., 1 year, 6 months) backward from a specified index date.
Likely invalid or redundant comment.

coderabbitai · 2024-07-02T18:47:08Z

README.md

@@ -22,13 +22,13 @@ To use MEDS-Tab, install the dependencies following commands below:

 **Pip Install**

-```bash
+```console


Use headings instead of emphasis for installation methods.

Replace emphasis with proper headings to adhere to markdown standards.

- **Pip Install** + ### Pip Install

- **Local Install** + ### Local Install

Also applies to: 31-31

coderabbitai · 2024-07-02T18:47:08Z

README.md

-### Scripts and Examples
+To use MEDS-Tab, install the dependencies following commands below:
+
+**Pip Install**


Use headings instead of emphasis for installation methods.

Replace emphasis with proper headings to adhere to markdown standards.

- **Pip Install** + ### Pip Install

- **Local Install** + ### Local Install

Also applies to: 53-53

Tools

Markdownlint

47-47: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

coderabbitai · 2024-07-02T18:47:08Z

README.md


-1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types
+1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories:


Fix grammatical issues and improve clarity.

Address loose punctuation marks and improve descriptions for better readability.

- 1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories: + 1. **`meds-tab-describe`**: Computes the frequencies of different code-types in MEDS data shards, categorizing them as:

- **Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]` + **Example: Tabularizing Static Data**: Minimum code frequency: 10, Window sizes: `[1d, 30d, 365d, full]`, Aggregation methods: `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`

- 3. **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard, `window_size`, and `aggregation` to generate feature vectors that aggregate patient data for each unique `patient_id` x `timestamp`. + 3. **`meds-tab-tabularize-time-series`**: Aggregates patient data for each unique `patient_id` x `timestamp` using combinations of `window_size` and `aggregation`.

- 4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`. + 4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.

- 5. **`meds-tab-xgboost`**: Trains an XGBoost model using user-specified parameters. Permutations of `window_sizes` and `aggs` can be generated using `generate-permutations` command (See the section below for descriptions). + 5. **`meds-tab-xgboost`**: Trains an XGBoost model using user-specified parameters. Permutations of `window_sizes` and `aggs` can be generated using the `generate-permutations` command (see below for descriptions).

- For example you can directly call **`generate-permutations`** in the command line: + For example, you can directly call **`generate-permutations`** in the command line:

Also applies to: 78-80, 92-92, 107-107, 120-120, 135-137

Tools

LanguageTool

[uncategorized] ~69-~69: Loose punctuation mark.
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

(UNLIKELY_OPENING_PUNCTUATION)

coderabbitai · 2024-07-02T18:47:08Z

README.md

-## Implementation Improvements
+## The MEDS-Tab Architecture
+
+In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.


Fix grammatical issues and improve clarity.

Address missing commas, hyphenation, and wordiness for better readability.

- In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process. + In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization and XGBoost tuning processes.

- Given time series data tabularize it 3. cache task specific rows of data for efficient loading 4. XGBoost training + Given time series data tabularize it 3. cache task-specific rows of data for efficient loading 4. XGBoost training

- This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients: + This initial stage processes a pre-sharded dataset. We expect a structure as follows, where each shard contains a subset of the patients:

- Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly speeds up the rolling window aggregations on datasets that have lots of concurrent observations. + Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly accelerates the rolling window aggregations on datasets with many concurrent observations.

- This reduces the memory footprint and speeds up the training process. + This reduces the memory footprint and accelerates the training process.

Also applies to: 208-208, 213-213, 265-265, 294-294

Tools

LanguageTool

[typographical] ~202-~202: It appears that a comma is missing.
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...

(DURING_THAT_TIME_COMMA)

coderabbitai · 2024-07-02T18:47:08Z

README.md

+
+## MEDS-Tab Tabularization Technique
+
+Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMICIV and EICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for EICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.


Fix grammatical issues and improve clarity.

Address missing commas, hyphenation, and wordiness for better readability.

- Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. + Tabularization of time-series data, as depicted above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory-efficient version of their method, which we denote `catabra-mem`.

- The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. + The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are a small number of patients (10, 100, and 500 patients) and should be processed quickly.

- #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size + #### 1.1.2 MIMIC-IV Task-Specific Training Cohort Size

Also applies to: 334-334, 462-462

Tools

LanguageTool

[uncategorized] ~330-~330: Possible missing comma found.
Context: ... more memory efficient version of their method which we denote catabra-mem. Other li...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~330-~330: Possible missing comma found.
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...

(AI_HYDRA_LEO_MISSING_COMMA)

coderabbitai · 2024-07-02T18:47:09Z

docs/source/installation.md

+
+To use MEDS-Tab, install the dependencies following commands below:
+
+**Pip Install**


Convert emphasized text to headings for clarity and consistency.

Using headings instead of bold text for section titles improves the structure and readability of the document.

-**Pip Install** +### Pip Install

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

**Pip Install**

### Pip Install

Tools

Markdownlint

23-23: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

coderabbitai · 2024-07-02T18:47:09Z

docs/source/implementation.md

+# The MEDS-Tab Architecture
+
+In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.


Consider adding a comma for better readability.

The introductory sentence could benefit from a comma after "In this section" to improve the flow and readability.

- In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process. + In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# The MEDS-Tab Architecture

In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.

# The MEDS-Tab Architecture

In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.

coderabbitai · 2024-07-02T18:47:09Z

docs/source/implementation.md

+
+## 1. Describe Codes (compute feature frequencies)
+
+This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients:


Consider adding a comma after 'as follows'.

This would clarify the separation between the introductory clause and the detailed description that follows.

- This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients: + This initial stage processes a pre-shareded dataset. We expect a structure as follows, where each shard contains a subset of the patients:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients:

This initial stage processes a pre-shareded dataset. We expect a structure as follows, where each shard contains a subset of the patients:

Tools

LanguageTool

[uncategorized] ~14-~14: Possible missing comma found.
Context: ...reded dataset. We expect a structure as follows where each shard contains a subset of t...

(AI_HYDRA_LEO_MISSING_COMMA)

coderabbitai · 2024-07-02T18:47:09Z

docs/source/implementation.md

+
+**Detailed Workflow:**
+
+- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.


Consider using a stronger verb choice.

Replacing "speeds up" with "accelerates" might provide a stronger and more formal expression in the documentation.

- This reduces the memory footprint and speeds up the training process. + This reduces the memory footprint and accelerates the training process.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.

- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and accelerates the training process.

Tools

LanguageTool

[style] ~93-~93: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

(SPEED_UP_ACCELERATE)

coderabbitai · 2024-07-02T18:47:09Z

docs/source/implementation.md

+
+3. **Event Aggregation**:
+
+   - Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly speeds up the rolling window aggregations on datasets that have lots of concurrent observations.


Consider replacing "lots of" with "many" to enhance formality and clarity.

The phrase "lots of" might be considered informal. Using "many" can improve the formality and clarity of the documentation.

- ...ndow aggregations on datasets that have lots of concurrent observations. + ...ndow aggregations on datasets that have many concurrent observations.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly speeds up the rolling window aggregations on datasets that have lots of concurrent observations.

- Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly speeds up the rolling window aggregations on datasets that have many concurrent observations.

Tools

LanguageTool

[style] ~64-~64: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative.
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...

(A_LOT_OF)

…re. Added logo

coderabbitai

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c6cc059 and d1ac6b2.

Files ignored due to path filters (3)

docs/assets/dark_purple_meds_tab.png is excluded by !**/*.png
docs/assets/light_purple_meds_tab.png is excluded by !**/*.png
docs/assets/white_meds_tab.png is excluded by !**/*.png

Files selected for processing (7)

docs/generate.sh (1 hunks)
docs/source/_static/switcher.json (1 hunks)
docs/source/conf.py (2 hunks)
docs/source/implementation.md (1 hunks)
docs/source/index.md (1 hunks)
docs/source/overview.md (1 hunks)
src/MEDS_tabular_automl/init.py (1 hunks)

Files skipped from review due to trivial changes (2)

docs/generate.sh
src/MEDS_tabular_automl/init.py

Additional context used

Biome

docs/source/_static/switcher.json

[error] 12-12: Expected an array, an object, or a literal but instead found ']'.

Expected an array, an object, or a literal here.

(parse)

LanguageTool

docs/source/index.md

[style] ~28-~28: Consider a shorter alternative to avoid wordiness.
Context: ...ed across arbitrary tasks and settings. In order to use MEDS-Tab, you will first need to tr...

(IN_ORDER_TO_PREMIUM)

[style] ~39-~39: Opting for a less wordy alternative here can improve the clarity of your writing.
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...

(NOT_ONLY_ALSO)

[style] ~39-~39: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 2513 characters long)
Context: ... datasets in reasonable raw formulations!

(EN_EXCESSIVE_EXCLAMATION)

docs/source/implementation.md

[style] ~92-~92: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

(SPEED_UP_ACCELERATE)

docs/source/overview.md

[uncategorized] ~34-~34: Loose punctuation mark.
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~43-~43: Loose punctuation mark.
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

(UNLIKELY_OPENING_PUNCTUATION)

[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

(THUS_SENTENCE)

[uncategorized] ~57-~57: Loose punctuation mark.
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~72-~72: Loose punctuation mark.
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

(UNLIKELY_OPENING_PUNCTUATION)

[grammar] ~74-~74: Possible subject-verb agreement error detected.
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

(PLURAL_THAT_AGREEMENT)

[uncategorized] ~85-~85: Loose punctuation mark.
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~100-~100: Loose punctuation mark.
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

(UNLIKELY_OPENING_PUNCTUATION)

[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

(COMMA_FOR_EXAMPLE)

[uncategorized] ~125-~125: Possible missing comma found.
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...

(AI_HYDRA_LEO_MISSING_COMMA)

[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

(EN_WORDINESS_PREMIUM_PRIOR_TO)

Markdownlint

docs/source/overview.md

12-12: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

18-18: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

Additional comments not posted (14)

docs/source/index.md (4)
3-3: Enhance the introductory sentence for greater impact.

The current introduction could be more engaging. Consider emphasizing the unique features of MEDS-Tab early in the description.
[EXISTING_COMMENT]
- MEDS-Tab is a library designed for automated tabularization, data preparation with aggregation, and time windowing. Check out below for an overview of MEDS-Tab and how it could be useful in your workflows!
+ MEDS-Tab revolutionizes medical data handling by automating tabularization, data preparation, and time windowing. Discover how MEDS-Tab can transform your medical data workflows below!
23-24: Enhance the bullet points for clarity and specificity.

The bullet points under "Why MEDS-Tab?" could benefit from more specific examples or metrics to substantiate the claims of cost efficiency and performance.
[EXISTING_COMMENT]
- - Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions
- - Strong Performance: MEDS-Tab provides robustness and high performance across various datasets compared with other frameworks.
+ - Cost Efficiency: MEDS-Tab reduces operational costs by up to 50% compared to traditional data handling solutions.
+ - Strong Performance: In benchmarks, MEDS-Tab consistently outperforms other frameworks, enhancing predictive accuracy by 20%.
28-28: Streamline the wording for clarity.

The phrase "In order to" can be simplified to make the sentence more direct and concise.
[EXISTING_COMMENT]
- In order to use MEDS-Tab, you will first need to transform your raw EHR data to a MEDS format, which can be done using the following libraries:
+ To use MEDS-Tab, first transform your raw EHR data to a MEDS format using the following libraries:
Tools

LanguageTool

[style] ~28-~28: Consider a shorter alternative to avoid wordiness.
Context: ...ed across arbitrary tasks and settings. In order to use MEDS-Tab, you will first need to tr...

(IN_ORDER_TO_PREMIUM)

39-39: Improve the sentence structure for enhanced readability.

The phrase "not only... but also" can be simplified to improve the flow of the text.
[EXISTING_COMMENT]
- This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning tasks for health projects.
+ This approach simplifies the process and ensures high-quality, reproducible results for your machine learning tasks in health projects.
Tools

LanguageTool

[style] ~39-~39: Opting for a less wordy alternative here can improve the clarity of your writing.
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...

(NOT_ONLY_ALSO)

[style] ~39-~39: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 2513 characters long)
Context: ... datasets in reasonable raw formulations!

(EN_EXCESSIVE_EXCLAMATION)
docs/source/implementation.md (4)
1-3: Consider adding a comma for better readability.

The introductory sentence could benefit from a comma after "In this section" to improve the flow and readability.
[EXISTING_COMMENT]
- In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
+ In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
14-14: Consider adding a comma after 'as follows'.

This would clarify the separation between the introductory clause and the detailed description that follows.
[EXISTING_COMMENT]
- This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients:
+ This initial stage processes a pre-shareded dataset. We expect a structure as follows, where each shard contains a subset of the patients:
71-71: Insert "a" before "Sparse array" to correct the determiner omission.

A determiner appears to be missing. Consider inserting it.
[EXISTING_COMMENT]
- Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.
+ A Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.
92-92: Consider using a stronger verb choice.

Replacing "speeds up" with "accelerates" might provide a stronger and more formal expression in the documentation.
[EXISTING_COMMENT]
- This reduces the memory footprint and speeds up the training process.
+ This reduces the memory footprint and accelerates the training process.
Tools

LanguageTool

[style] ~92-~92: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

(SPEED_UP_ACCELERATE)
docs/source/overview.md (5)
1-2: Clarify the repository's purpose in the introduction.

The introduction could be expanded to provide more details about the specific capabilities and advantages of the repository.
[EXISTING_COMMENT]

12-12: Convert emphasized text to headings for clarity and consistency.

Using headings instead of bold text for section titles can improve the structure and readability of the document.
[EXISTING_COMMENT]
-**Pip Install**
+### Pip Install

-**Local Install**
+### Local Install
Tools

Markdownlint

12-12: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

18-18: Convert emphasis to heading for clarity and consistency.

The emphasis on "Local Install" should be converted to a heading to maintain consistency and improve document structure.
[EXISTING_COMMENT]
- **Local Install**
+ ## Local Install
Tools

Markdownlint

18-18: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

43-43: Clarify conjunction usage in sentence.

The use of "thus" in the description of meds-tab-tabularize-static might confuse readers as it suggests a cause-effect relationship not clearly stated.
[EXISTING_COMMENT]
- thus rows are duplicated across multiple timestamps for the same patient.
+ Consequently, rows are duplicated across multiple timestamps for the same patient.
Tools

LanguageTool

[uncategorized] ~43-~43: Loose punctuation mark.
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

(UNLIKELY_OPENING_PUNCTUATION)

[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

(THUS_SENTENCE)

102-102: Add a comma after introductory phrase for clarity.

The phrase "For example" should be followed by a comma to improve readability.
[EXISTING_COMMENT]
- For example you can directly call **`generate-permutations`**
+ For example, you can directly call **`generate-permutations`**
Tools

LanguageTool

[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

(COMMA_FOR_EXAMPLE)
docs/source/conf.py (1)
24-24: Ensure version_match is correctly set.

The environment variable READTHEDOCS_VERSION might not always be set, leading to potential issues. Ensure there is a fallback mechanism.
- version_match = os.environ.get("READTHEDOCS_VERSION")
+ version_match = os.environ.get("READTHEDOCS_VERSION", "dev")

coderabbitai · 2024-07-03T02:43:22Z

docs/source/_static/switcher.json

+      "url": "https://pydata-sphinx-theme.readthedocs.io/en/stable/",
+      "preferred": true
+    },
+  ]


Fix JSON syntax error.

The JSON array is not correctly closed. Add a closing bracket to fix the syntax error.

- ] +]

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

]

]

Tools

Biome

[error] 12-12: Expected an array, an object, or a literal but instead found ']'.

Expected an array, an object, or a literal here.

(parse)

coderabbitai · 2024-07-03T02:43:23Z

docs/source/conf.py

@@ -1,5 +1,9 @@
 import os
+import shutil


Remove unused import.

The shutil module is imported but not used in the script.

-import shutil

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import shutil

…with the latest readthedocs page

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d1ac6b2 and 0d148d7.

Files selected for processing (2)

docs/source/_static/switcher.json (1 hunks)
docs/source/conf.py (2 hunks)

Files skipped from review as they are similar to previous changes (1)

docs/source/_static/switcher.json

Additional comments not posted (32)

docs/source/conf.py (32)
4-4: Approved: Importing Path from pathlib.

The Path module is used for path operations in the script.

17-18: Approved: Updated project metadata.

The project metadata, including project, copyright, and author, has been updated appropriately.

22-22: Approved: Added json_url variable.

The json_url variable is defined for the version switcher.

24-24: Approved: Added version_match variable.

The version_match variable is defined for version handling.

25-25: Approved: Set release variable.

The release variable is set to the version of MEDS_tabular_automl.

29-41: Approved: Logic for handling version_match.

The script includes logic to handle different values of version_match.

49-49: Approved: Set language to "en".

The language for the documentation is set to English.

52-53: Approved: Defined __location__ and __src__.

The __location__ and __src__ variables are defined using Path.

58-58: Approved: Added source directory to sys.path.

The script adds the source directory to sys.path.

61-76: Approved: Defined ensure_pandoc_installed function.

The function ensure_pandoc_installed ensures Pandoc is installed.

79-96: Approved: Script for running sphinx-apidoc.

The script runs sphinx-apidoc automatically.

108-119: Approved: Added Sphinx extensions.

The list of Sphinx extensions includes several new extensions.

123-123: Approved: Set HTML theme to pydata_sphinx_theme.

The HTML theme is set to pydata_sphinx_theme.

125-129: Approved: Updated sidebar configuration.

The sidebar configuration for the HTML theme is updated.

131-131: Approved: Set nbsphinx_allow_errors to True.

The nbsphinx_allow_errors variable is set to True.

134-136: Approved: Defined and created collections_dir.

The collections_dir is defined and created if it does not exist.

138-138: Approved: Defined python_version variable.

The python_version variable is defined for intersphinx mapping.

140-153: Approved: Updated intersphinx mapping.

The intersphinx mapping includes several new entries.

162-162: Approved: Updated exclude_patterns variable.

The exclude_patterns variable is updated.

165-165: Approved: Set nb_execution_mode to "off".

The nb_execution_mode variable is set to "off".

171-182: Approved: Updated myst_enable_extensions variable.

The myst_enable_extensions variable includes several new extensions.

184-184: Approved: Set myst_update_mathjax to True.

The myst_update_mathjax variable is set to True.

187-199: Approved: Updated myst_url_schemes variable.

The myst_url_schemes variable includes several new URL schemes.

202-202: Approved: Updated source_suffix variable.

The source_suffix variable is updated.

208-208: Approved: Set master_doc to "index".

The master_doc variable is set to "index".

225-225: Approved: Set pygments_style to "tango".

The pygments_style variable is set to "tango".

235-235: Approved: Set todo_emit_warnings to True.

The todo_emit_warnings variable is set to True.

243-244: Approved: Set html_title and html_short_title.

The html_title and html_short_title variables are set.

249-287: Approved: Updated html_theme_options variable.

The html_theme_options variable includes several new options.

289-293: Approved: Defined html_context variable.

The html_context variable is defined.

328-329: Approved: Set html_show_copyright and htmlhelp_basename.

The html_show_copyright and htmlhelp_basename variables are set.

2-2: Remove unused import.

The shutil module is imported but not used in the script.
-import shutil
Likely invalid or redundant comment.

coderabbitai

Actionable comments posted: 4

Outside diff range and nitpick comments (6)

docs/source/profiling.md (5)
7-7: Add a comma for better readability.

The sentence could benefit from a comma after "comparison" to improve the flow and readability.
- Specifically, for comparison we used:
+ Specifically, for comparison, we used:
24-24: Remove "of" for conciseness.

The phrase "in all of the scenarios" can be simplified to "in all scenarios" to enhance readability and conciseness.
- ...emphasizing the better performance of MEDS-Tab in all of the scenarios.
+ ...emphasizing the better performance of MEDS-Tab in all scenarios.
Tools

LanguageTool

[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)

[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)

24-24: Correct hyphenation in adjectival compound.

When a number forms part of an adjectival compound, it should be hyphenated to improve readability.
- Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
+ Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget.
Tools

LanguageTool

[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)

[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)

20-20: Consider adding a comma for better readability.

The sentence could benefit from a comma after "datasets" to improve the flow and readability.
- ...w that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab.
+ ...w that on the MIMIC-IV and eICU medical datasets, we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab.
84-84: Ensure the file ends with a single newline character.

Files should end with a single newline character to adhere to best practices.
+ \n
docs/source/prediction.md (1)
85-85: Specify language for fenced code blocks to adhere to Markdown best practices.

Fenced code blocks should have a language specified.
- ```
+ ```bash

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 0d148d7 and 11e4623.

Files selected for processing (2)

docs/source/prediction.md (1 hunks)
docs/source/profiling.md (1 hunks)

Additional context used

LanguageTool

docs/source/profiling.md

[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)

[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)

docs/source/prediction.md

[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... 11,830 | #### 1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

(SPECIFIC_HYPHEN)

[uncategorized] ~195-~195: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... | 14 | #### 3. eICU Task Specific Training Cohort Size | Task ...

(SPECIFIC_HYPHEN)

Markdownlint

docs/source/prediction.md

121-121: null
Spaces inside code span elements

(MD038, no-space-in-code)

coderabbitai · 2024-07-03T04:29:44Z

docs/source/prediction.md

+  tabularization.min_code_inclusion_frequency: tag(log, range(10, 1000000))
+```
+
+Note that the XGBoost command shown includes `tabularization.window_sizes` and ` tabularization.aggs` in the parameters to sweep over.


Remove spaces inside code span elements.

Spaces inside code span elements should be removed to adhere to best practices.

- ` tabularization.aggs` + `tabularization.aggs`

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Note that the XGBoost command shown includes `tabularization.window_sizes` and ` tabularization.aggs` in the parameters to sweep over.

Note that the XGBoost command shown includes `tabularization.window_sizes` and `tabularization.aggs` in the parameters to sweep over.

Tools

Markdownlint

121-121: null
Spaces inside code span elements

(MD038, no-space-in-code)

coderabbitai · 2024-07-03T04:29:45Z

docs/source/prediction.md

+| LOS in Hospital > 3 days        | Admission + 24 hr | 6m5s      | 7m5s      | 1m4s     | 11,012           | 12,223            |
+| LOS in Hospital > 3 days        | Admission + 48 hr | 6m10s     | 7m12s     | 1m4s     | 10,703           | 11,830            |
+
+#### 1.2 MIMIC-IV Task Specific Training Cohort Size


Hyphenate "Task-Specific" for grammatical correctness.

When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.

- #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size + #### 1.1.2 MIMIC-IV Task-Specific Training Cohort Size

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#### 1.2 MIMIC-IV Task Specific Training Cohort Size

#### 1.2 MIMIC-IV Task-Specific Training Cohort Size

Tools

LanguageTool

[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... 11,830 | #### 1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

(SPECIFIC_HYPHEN)

coderabbitai · 2024-07-03T04:29:45Z

docs/source/prediction.md

+| Post-discharge 30 day Mortality | Discharge         | 0.003 | 0.0116 | 0.001 | 0.730     | 13                   | 986                       | 7                     | 7              |
+| Post-discharge 1 year Mortality | Discharge         | 0.005 | 0.006  | 0.002 | 0.690     | 93                   | 938                       | 6                     | 14             |
+
+#### 3. eICU Task Specific Training Cohort Size


Hyphenate "Task-Specific" for grammatical correctness.

When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.

- #### 3. eICU Task Specific Training Cohort Size + #### 3. eICU Task-Specific Training Cohort Size

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#### 3. eICU Task Specific Training Cohort Size

#### 3. eICU Task-Specific Training Cohort Size

Tools

LanguageTool

[uncategorized] ~195-~195: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... | 14 | #### 3. eICU Task Specific Training Cohort Size | Task ...

(SPECIFIC_HYPHEN)

coderabbitai

Actionable comments posted: 19

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 11e4623 and 7fe635a.

Files selected for processing (2)

README.md (2 hunks)
docs/source/profiling.md (1 hunks)

Files not summarized due to errors (1)

README.md: Error: Message exceeds token limit

Additional context used

LanguageTool

docs/source/profiling.md

[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)

[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)

README.md

[style] ~61-~61: Consider a shorter alternative to avoid wordiness.
Context: ...ed across arbitrary tasks and settings. In order to use MEDS-Tab, you will first need to tr...

(IN_ORDER_TO_PREMIUM)

[style] ~72-~72: Opting for a less wordy alternative here can improve the clarity of your writing.
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...

(NOT_ONLY_ALSO)

[uncategorized] ~76-~76: Loose punctuation mark.
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~85-~85: Loose punctuation mark.
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

(UNLIKELY_OPENING_PUNCTUATION)

[typographical] ~85-~85: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

(THUS_SENTENCE)

[uncategorized] ~99-~99: Loose punctuation mark.
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~114-~114: Loose punctuation mark.
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

(UNLIKELY_OPENING_PUNCTUATION)

[grammar] ~116-~116: Possible subject-verb agreement error detected.
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

(PLURAL_THAT_AGREEMENT)

[uncategorized] ~127-~127: Loose punctuation mark.
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~142-~142: Loose punctuation mark.
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

(UNLIKELY_OPENING_PUNCTUATION)

[typographical] ~144-~144: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

(COMMA_FOR_EXAMPLE)

[uncategorized] ~167-~167: Possible missing comma found.
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...

(AI_HYDRA_LEO_MISSING_COMMA)

[style] ~177-~177: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

(EN_WORDINESS_PREMIUM_PRIOR_TO)

[uncategorized] ~220-~220: Possible missing comma found.
Context: ...reded dataset. We expect a structure as follows where each shard contains a subset of t...

(AI_HYDRA_LEO_MISSING_COMMA)

[style] ~298-~298: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

(SPEED_UP_ACCELERATE)

[uncategorized] ~334-~334: Possible missing comma found.
Context: ... more memory efficient version of their method which we denote catabra-mem. Other li...

(AI_HYDRA_LEO_MISSING_COMMA)

[style] ~338-~338: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)

[uncategorized] ~338-~338: Possible missing comma found.
Context: ... that catabra-mem is omitted from the tables as it never completed within the 10 min...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~338-~338: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)

[uncategorized] ~462-~462: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... 11,830 | #### 1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

(SPECIFIC_HYPHEN)

[uncategorized] ~594-~594: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... | 14 | #### 3. eICU Task Specific Training Cohort Size | Task ...

(SPECIFIC_HYPHEN)

Markdownlint

README.md

21-21: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)

32-32: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

38-38: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

207-207: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)

315-315: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)

400-400: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)

520-520: null
Spaces inside code span elements

(MD038, no-space-in-code)

Additional comments not posted (4)

docs/source/profiling.md (2)
20-20: Correct the spelling error.

The word "depecited" should be corrected to "depicted."
- Tabularization of time-series data, as depecited above, is commonly used in several past works.
+ Tabularization of time-series data, as depicted above, is commonly used in several past works.
Likely invalid or redundant comment.

24-24: Remove "of" after "all" for conciseness.

The phrase "in all of the scenarios" can be simplified to "in all scenarios" to enhance readability and conciseness.
- ...ing the better performance of MEDS-Tab in all of the scenarios.
+ ...ing the better performance of MEDS-Tab in all scenarios.
Likely invalid or redundant comment.

Tools

LanguageTool

[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)

[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)
README.md (2)

52-58: LGTM!

The "Why MEDS-Tab?" section is clear and informative.

400-400: LGTM!

The "Prediction Performance" section is clear and informative.

Tools

Markdownlint

400-400: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)

coderabbitai · 2024-07-03T04:46:07Z

docs/source/profiling.md

+
+### MEDS-Tab Tabularization Technique
+
+Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.


Add a comma for better readability.

Consider adding a comma after "method" for better readability.

- ...more memory efficient version of their method which we denote `catabra-mem`. + ...more memory efficient version of their method, which we denote `catabra-mem`.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.

Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method, which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.

coderabbitai · 2024-07-03T04:46:08Z

docs/source/profiling.md

+
+## 2. Comparative Performance Analysis
+
+The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.


Correct hyphenation in adjectival compound.

When a number forms part of an adjectival compound, it should be hyphenated to improve readability.

- Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget. + Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.

The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget.

Tools

LanguageTool

[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)

[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)

coderabbitai · 2024-07-03T04:46:08Z

README.md

+1. Construction and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time windows (e.g. 1 year, 6 months, etc.), which go backwards in time from a given index date.
+2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems.
+
+## Quick Start


Use headings instead of emphasis for installation methods.

Replace emphasis with proper headings to adhere to markdown standards.

- **Pip Install** + ### Pip Install

Committable suggestion was skipped due to low confidence.

coderabbitai · 2024-07-03T04:46:08Z

README.md


 To use MEDS-Tab, install the dependencies following commands below:

 **Pip Install**

-```bash
+```console
 pip install meds-tab
 ```

 **Local Install**


Use headings instead of emphasis for installation methods.

Replace emphasis with proper headings to adhere to markdown standards.

- **Local Install** + ### Local Install

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

**Local Install**

### Local Install

Tools

Markdownlint

38-38: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

coderabbitai · 2024-07-03T04:46:08Z

README.md

-6. **`meds-tab-xgboost-sweep`**: Conducts an Optuna hyperparameter sweep to optimize over `window_sizes`, `aggregations`, and `min_code_inclusion_frequency`, aiming to enhance model performance and adaptability.
-
-### Additional CLI Scripts
+## Additional CLI Scripts

 1. **`generate-permutations`**: Generates and prints a sorted list of all permutations from a comma separated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations.

   For example you can directly call **`generate-permutations`** in the command line:


Add a comma after 'For example'.

Improve readability by adding a comma.

- For example you can directly call **`generate-permutations`** in the command line: + For example, you can directly call **`generate-permutations`** in the command line:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

For example you can directly call **`generate-permutations`** in the command line:

For example, you can directly call **`generate-permutations`** in the command line:

Tools

LanguageTool

[typographical] ~144-~144: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

(COMMA_FOR_EXAMPLE)

coderabbitai · 2024-07-03T04:46:09Z

README.md

+
+**Detailed Workflow:**
+
+- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.


Use a stronger verb for clarity.

Replace "speeds up" with "accelerates" for stronger wording.

- This reduces the memory footprint and speeds up the training process. + This reduces the memory footprint and accelerates the training process.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.

- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and accelerates the training process.

Tools

LanguageTool

[style] ~298-~298: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

(SPEED_UP_ACCELERATE)

coderabbitai · 2024-07-03T04:46:09Z

README.md

+
+## 2. Comparative Performance Analysis
+
+The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.


Fix grammatical issues and improve clarity.

Address missing commas and improve descriptions for better readability.

- The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget. + The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such a small number of patients (10, 100, and 500 patients), data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.

The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such a small number of patients (10, 100, and 500 patients), data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget.

Tools

LanguageTool

[style] ~338-~338: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)

[uncategorized] ~338-~338: Possible missing comma found.
Context: ... that catabra-mem is omitted from the tables as it never completed within the 10 min...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~338-~338: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)

coderabbitai · 2024-07-03T04:46:09Z

README.md

-   downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced --
-   what is more advanced is the efficient construction, storage, and loading of tabular features for the
-   candidate AutoML models, enabling a far more extensive search over different featurization strategies.
+See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.


Clarify the example description.

Improve the description for better readability.

- See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm. + See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline run on synthetic data. This script is a functional test executed with `pytest` to verify the algorithm's correctness.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.

See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline run on synthetic data. This script is a functional test executed with `pytest` to verify the algorithm's correctness.

coderabbitai · 2024-07-03T04:46:09Z

README.md

+## Core CLI Scripts Overview
+
+1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories:

   - time-series codes (codes with timestamps)
   - time-series numerical values (codes with timestamps and numerical values)
   - static codes (codes without timestamps)
   - static numerical codes (codes without timestamps but with numerical values).

-   **Caching feature names and frequencies** in a dataset stored in `"path_to_data"`
-
-   ```
-   meds-tab-describe MEDS_cohort_dir="path_to_data"
-   ```
+   This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.


Fix grammatical issues and improve clarity.

Address loose punctuation marks and improve descriptions for better readability.

- 1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories: + 1. **`meds-tab-describe`**: Computes the frequencies of different code-types in MEDS data shards, categorizing them as:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

## Core CLI Scripts Overview

1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories:

- time-series codes (codes with timestamps)

- time-series numerical values (codes with timestamps and numerical values)

- static codes (codes without timestamps)

- static numerical codes (codes without timestamps but with numerical values).

**Caching feature names and frequencies** in a dataset stored in `"path_to_data"`

```

meds-tab-describe MEDS_cohort_dir="path_to_data"

```

This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.

## Core CLI Scripts Overview

1. **`meds-tab-describe`**: Computes the frequencies of different code-types in MEDS data shards, categorizing them as:

- time-series codes (codes with timestamps)

- time-series numerical values (codes with timestamps and numerical values)

- static codes (codes without timestamps)

- static numerical codes (codes without timestamps but with numerical values).

This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.

Tools

LanguageTool

[uncategorized] ~76-~76: Loose punctuation mark.
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

(UNLIKELY_OPENING_PUNCTUATION)

coderabbitai · 2024-07-03T04:46:09Z

README.md

+
+### MEDS-Tab Tabularization Technique
+
+Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.


Fix grammatical issues and improve clarity.

Address missing commas and improve descriptions for better readability.

- Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. + Tabularization of time-series data, as depicted above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory-efficient version of their method, which we denote `catabra-mem`.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.

Tabularization of time-series data, as depicted above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory-efficient version of their method, which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.

Tools

LanguageTool

[uncategorized] ~334-~334: Possible missing comma found.
Context: ... more memory efficient version of their method which we denote catabra-mem. Other li...

(AI_HYDRA_LEO_MISSING_COMMA)

mmcdermott and others added 6 commits June 13, 2024 03:31

Removed auto generated API files

fa228da

Small README modifications

27cba9b

Use relative imports inside the (now nested) scripts files.

adf93b0

Fixed unexpected first-element-only issue with the readthedocs API. F…

e39b934

…ormat is still very off though.

checkpointing rst

aeadfde

fixed docs

2f72bb8

updated requirements

b0365a6

coderabbitai bot reviewed Jun 13, 2024

View reviewed changes

Add PDF format

b2655a6

coderabbitai bot reviewed Jun 13, 2024

View reviewed changes

mmcdermott and others added 8 commits June 13, 2024 06:58

trying a different readthedocs config for dependency management

0bcde72

Update index.md

1fe9186

Added appendix A to README

14bad2b

Added B

5f75c2f

Added C

0daab40

Added D

7929e5a

Fixed spelling issues

85bfd5e

added docs

f6e229a

coderabbitai bot reviewed Jun 13, 2024

View reviewed changes

added eicu docs

0227900

coderabbitai bot reviewed Jun 13, 2024

View reviewed changes

Nassim Oufattole and others added 3 commits June 13, 2024 12:12

added appendix b

16325b2

fixed overview being two chapters

3e81f60

beginning small updates to docs

1ab0405

coderabbitai bot reviewed Jun 18, 2024

View reviewed changes

index md edit

46fc6aa

Merge branch 'docs' of github.com:mmcdermott/MEDS_Tabular_AutoML into…

88320ea

… docs

coderabbitai bot reviewed Jun 24, 2024

View reviewed changes

bash to console

4a57e11

coderabbitai bot reviewed Jun 24, 2024

View reviewed changes

aleksiakolo added 4 commits June 24, 2024 17:08

Revert "bash to console"

a374b26

This reverts commit 4a57e11.

overview bash to console

72dc060

last bash to console

6eddea4

code quality fixed ?

ba90809

coderabbitai bot reviewed Jun 24, 2024

View reviewed changes

aleksiakolo and others added 2 commits June 24, 2024 17:59

minor structure changes

f3d615f

Update pyproject.toml

c6cc059

coderabbitai bot reviewed Jul 2, 2024

View reviewed changes

Cleaned up docs formating. Added versioning. Fixed typos. Added featu…

d1ac6b2

…re. Added logo

coderabbitai bot reviewed Jul 3, 2024

View reviewed changes

Nassim Oufattole added 4 commits July 3, 2024 03:00

updated versions

c82703b

updated the switcher json file which had a bug

1da8e6d

updated dev version switcher for config so dev version is associated …

40e85ce

…with the latest readthedocs page

updated switcher json to point to the latest read the docs page

0d148d7

coderabbitai bot reviewed Jul 3, 2024

View reviewed changes

cleaned up tables

11e4623

coderabbitai bot reviewed Jul 3, 2024

View reviewed changes

Nassim Oufattole added 2 commits July 3, 2024 04:30

updated the readme

a7cb194

cleaned up tables

7fe635a

coderabbitai bot reviewed Jul 3, 2024

View reviewed changes

Oufattole merged commit 9f4dde8 into main Jul 3, 2024
3 checks passed

mmcdermott deleted the docs branch August 10, 2024 17:27

coderabbitai bot mentioned this pull request Sep 8, 2024

added autogluon support, more models, more preprocessing strategies #81

Merged

This was referenced Oct 22, 2024

Documentation Overhaul for meds-tab #96

Merged

Adding updated documentation #97

Merged

coderabbitai bot mentioned this pull request Nov 6, 2024

Dev #100

Merged


		- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space!

		By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!

		@@ -0,0 +1,14 @@
		# How does MEDS-Tab Work?

		#### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?

	#### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?
	### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?


		### Additional CLI Scripts

		1. `generate-permutations`: Generates and prints a sorted list of all permutations from a comma separated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations.


		4. `meds-tab-cache-task`: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.

		Aligh tabularized data for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)

		- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions
		- Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks.

		1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date.
		2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems.

	Example: Aligh tabularized data for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)
	Example: Align Tabularized Data: For a specific task `$TASK`, align labels that have been pulled from [ACES](https://github.com/justin13601/ACES).


		This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients:

		```

	3. cache task specific rows of data for efficient loading
	3. cache task-specific rows of data for efficient loading


		5. Output Storage:

		- Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.

	- Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.
	- A Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.


		## MEDS-Tab Tabularization Technique

		Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMICIV and EICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for EICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.

	______________________________________________________________________
	______________________________________________________________________


		A single XGBoost run was completed to profile time and memory usage. This was done for each `$TASK` using the following command:

		```

	#### 1.1.2 MIMIC-IV Task Specific Training Cohort Size
	#### 1.1.2 MIMIC-IV Task-Specific Training Cohort Size


		## 2. Comparative Performance Analysis

		The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.


		This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.

		2. `meds-tab-tabularize-static`: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.


		To use MEDS-Tab, install the dependencies following commands below:

		Pip Install


		1. `meds-tab-describe`: This command processes MEDS data shards to compute the frequencies of different code-types
		1. `meds-tab-describe`: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories:

Docs #36

Docs #36

Conversation

mmcdermott commented Jun 13, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Jun 13, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Poem

gpt-4o model for reviews and chat is now live

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

codecov-commenter commented Jun 13, 2024 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 18, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 18, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 18, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 24, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

mmcdermott commented Jun 13, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 13, 2024 •

edited

Loading

`gpt-4o` model for reviews and chat is now live

CodeRabbit Configration File (`.coderabbit.yaml`)

codecov-commenter commented Jun 13, 2024 •

edited

Loading

		# The MEDS-Tab Architecture

		In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.


		## 1. Describe Codes (compute feature frequencies)

		This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients:


		Detailed Workflow:

		- Row Selection Based on Tasks: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.


		3. Event Aggregation:

		- Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly speeds up the rolling window aggregations on datasets that have lots of concurrent observations.

	Note that the XGBoost command shown includes `tabularization.window_sizes` and ` tabularization.aggs` in the parameters to sweep over.
	Note that the XGBoost command shown includes `tabularization.window_sizes` and `tabularization.aggs` in the parameters to sweep over.

	#### 1.2 MIMIC-IV Task Specific Training Cohort Size
	#### 1.2 MIMIC-IV Task-Specific Training Cohort Size

	#### 3. eICU Task Specific Training Cohort Size
	#### 3. eICU Task-Specific Training Cohort Size


		### MEDS-Tab Tabularization Technique

		Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.

	For example you can directly call `generate-permutations` in the command line:
	For example, you can directly call `generate-permutations` in the command line:

	See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.
	See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline run on synthetic data. This script is a functional test executed with `pytest` to verify the algorithm's correctness.