Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs #36

Merged
merged 48 commits into from
Jul 3, 2024
Merged

Docs #36

merged 48 commits into from
Jul 3, 2024

Conversation

mmcdermott
Copy link
Owner

@mmcdermott mmcdermott commented Jun 13, 2024

Summary by CodeRabbit

  • New Features

    • Added detailed documentation including installation, implementation, and performance profiling of the MEDS-Tab system.
  • Documentation

    • Enhanced README with updated installation instructions and roadmap.
    • Added index.md, installation.md, implementation.md, overview.md, prediction.md, and profiling.md to provide comprehensive information on MEDS-Tab functionalities.
  • Chores

    • Updated .gitignore to exclude specific documentation directories.
    • Modified documentation dependencies and configurations in docs/requirements.txt and docs/source/conf.py.
  • Refactor

    • Improved import statements across various scripts for better modularity and maintainability.

Copy link
Contributor

coderabbitai bot commented Jun 13, 2024

Walkthrough

The latest changes primarily enhance the MEDS-Tab system's documentation, install processes, code structure, and functionalities. New documentation files explain various features, including installation, implementation, prediction, and profiling. The .gitignore and README.md files have been updated for better clarity and project management. Import paths in several Python scripts are refactored to use relative imports, improving the internal code structure.

Changes

File/Path Summary
.gitignore Added exclusions for specific documentation directories.
README.md Updated installation instructions, script descriptions, and roadmap for future improvements.
docs/make.bat Added a conditional check for an argument and adjusted script flow.
docs/requirements.txt Updated dependencies and added new packages like esgpt and pydata-sphinx-theme.
docs/source/conf.py Extensive changes to project metadata, theme configurations, extensions, and version handling.
docs/source/index.md, docs/source/installation.md, docs/source/implementation.md, docs/source/overview.md, docs/source/prediction.md, docs/source/profiling.md Added new documentation sections explaining features, usage, and comparisons of MEDS-Tab.
docs/generate.sh Added a new script to clean build directory, generate HTML files, and serve them.
src/MEDS_...scripts/... Refactored import statements to use relative imports for better organization.
src/MEDS_tabular_automl/utils.py Modified OmegaConf.register_new_resolver to include replace=True.
.readthedocs.yaml Added PDF build format option for Sphinx configuration.
pyproject.toml Adjusted scipy dependency version to <1.14.0.

Sequence Diagram(s)

No sequence diagrams were generated as the changes are primarily related to documentation, import paths, and configuration updates rather than new features or significant alterations to control flow.

Poem

In fields of code, where rabbits hop,
They polished scripts, each line, each top.
With docs now clear as morning dew,
Installation paths, they shine anew.
Imports settled, no more fight,
Our MEDS now gleam in data's light.
📚🐇✨

Tip

AI model upgrade

gpt-4o model for reviews and chat is now live

OpenAI claims that this model is better at understanding and generating code than the previous models. Please join our Discord Community to provide any feedback or to report any issues.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@codecov-commenter
Copy link

codecov-commenter commented Jun 13, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.73%. Comparing base (cfba4ae) to head (7fe635a).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #36      +/-   ##
==========================================
+ Coverage   91.72%   91.73%   +0.01%     
==========================================
  Files          13       14       +1     
  Lines         822      823       +1     
==========================================
+ Hits          754      755       +1     
  Misses         68       68              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Outside diff range and nitpick comments (3)
src/MEDS_tabular_automl/scripts/tabularize_time_series.py (1)

Line range hint 94-102: Address the loop variable binding issue to prevent potential bugs in asynchronous or concurrent execution.

- for shard_fp, window_size, agg in iter_wrapper(tabularization_tasks):
+ for shard_fp, window_size, agg in iter_wrapper(list(tabularization_tasks)):
Tools
Ruff

15-15: Module level import not at top of file (E402)


16-16: Module level import not at top of file (E402)


18-18: Module level import not at top of file (E402)


19-19: Module level import not at top of file (E402)


20-20: Module level import not at top of file (E402)


21-21: Module level import not at top of file (E402)


22-22: Module level import not at top of file (E402)

src/MEDS_tabular_automl/scripts/tabularize_static.py (1)

Line range hint 142-142: Address the loop variable binding issue to prevent potential bugs in asynchronous or concurrent execution.

- for shard_fp, agg in iter_wrapper(tabularization_tasks):
+ for shard_fp, agg in iter_wrapper(list(tabularization_tasks)):
Tools
Ruff

15-15: Module level import not at top of file (E402)


17-22: Module level import not at top of file (E402)


23-23: Module level import not at top of file (E402)


24-24: Module level import not at top of file (E402)


25-25: Module level import not at top of file (E402)

src/MEDS_tabular_automl/scripts/launch_xgboost.py (1)

Line range hint 90-90: Remove unnecessary True if ... else False in condition.

- code_mask = [True if idx in codes_set else False for idx in feature_ids]
+ code_mask = [idx in codes_set for idx in feature_ids]
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between cfba4ae and 2f72bb8.

Files ignored due to path filters (3)
  • docs/assets/main_fig.png is excluded by !**/*.png
  • docs/assets/meds_logo.png is excluded by !**/*.png
  • docs/assets/pivot.png is excluded by !**/*.png
Files selected for processing (17)
  • .gitignore (2 hunks)
  • README.md (3 hunks)
  • docs/make.bat (2 hunks)
  • docs/requirements.txt (1 hunks)
  • docs/source/conf.py (3 hunks)
  • docs/source/implementation.md (1 hunks)
  • docs/source/index.md (1 hunks)
  • docs/source/installation.md (1 hunks)
  • docs/source/overview.md (1 hunks)
  • docs/source/prediction.md (1 hunks)
  • docs/source/profiling.md (1 hunks)
  • src/MEDS_tabular_automl/scripts/cache_task.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/describe_codes.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/launch_xgboost.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/tabularize_static.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/tabularize_time_series.py (1 hunks)
  • src/MEDS_tabular_automl/utils.py (1 hunks)
Files skipped from review due to trivial changes (5)
  • .gitignore
  • docs/source/prediction.md
  • docs/source/profiling.md
  • src/MEDS_tabular_automl/scripts/cache_task.py
  • src/MEDS_tabular_automl/scripts/describe_codes.py
Additional context used
Markdownlint
docs/source/implementation.md

3-3: Expected: h2; Actual: h4 (MD001, heading-increment)
Heading levels should only increment by one level at a time

docs/source/installation.md

19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


31-31: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

docs/source/overview.md

14-14: Expected: h2; Actual: h3 (MD001, heading-increment)
Heading levels should only increment by one level at a time


33-33: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


41-41: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


53-53: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


68-68: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


79-79: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

README.md

19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


36-36: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


150-150: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


167-167: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


169-169: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


31-31: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


68-68: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


76-76: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


88-88: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


103-103: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


114-114: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

LanguageTool
docs/source/index.md

[style] ~48-~48: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...ithin the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...


[style] ~48-~48: Using many exclamation marks might seem excessive (in this case: 8 exclamation marks for a text that’s 2711 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!

docs/source/overview.md

[uncategorized] ~24-~24: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...


[uncategorized] ~37-~37: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...``` 2. meds-tab-tabularize-static: Filters and processes the dataset based...


[typographical] ~37-~37: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...


[uncategorized] ~49-~49: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...


[uncategorized] ~64-~64: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...


[grammar] ~66-~66: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...


[uncategorized] ~77-~77: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...


[uncategorized] ~77-~77: You might be missing the article “the” here. (AI_EN_LECTOR_MISSING_DETERMINER_THE)
Context: ...izesandaggscan be generated usinggenerate-permutations` command (See the ...


[uncategorized] ~90-~90: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... ``` 6. meds-tab-xgboost-sweep: Conducts an Optuna hyperparameter sweep...


[uncategorized] ~94-~94: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...


[typographical] ~96-~96: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...


[uncategorized] ~96-~96: The preposition “on” seems more likely in this position than the preposition “in”. (AI_EN_LECTOR_REPLACEMENT_PREPOSITION_IN_ON)
Context: ...rectly call generate-permutations in the command line: ```bash genera...

README.md

[uncategorized] ~59-~59: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...


[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...``` 2. meds-tab-tabularize-static: Filters and processes the dataset based...


[typographical] ~72-~72: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...


[uncategorized] ~84-~84: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...


[uncategorized] ~99-~99: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...


[grammar] ~101-~101: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...


[uncategorized] ~112-~112: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...


[uncategorized] ~125-~125: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... ``` 6. meds-tab-xgboost-sweep: Conducts an Optuna hyperparameter sweep...


[uncategorized] ~129-~129: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...


[typographical] ~131-~131: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

Ruff
src/MEDS_tabular_automl/scripts/tabularize_time_series.py

8-8: Module level import not at top of file (E402)


9-9: Module level import not at top of file (E402)


10-10: Module level import not at top of file (E402)


11-11: Module level import not at top of file (E402)


13-13: Module level import not at top of file (E402)


14-14: Module level import not at top of file (E402)


15-15: Module level import not at top of file (E402)


16-16: Module level import not at top of file (E402)


18-18: Module level import not at top of file (E402)


19-19: Module level import not at top of file (E402)


20-20: Module level import not at top of file (E402)


21-21: Module level import not at top of file (E402)


22-22: Module level import not at top of file (E402)


23-30: Module level import not at top of file (E402)


94-94: Function definition does not bind loop variable agg (B023)


101-101: Function definition does not bind loop variable window_size (B023)


102-102: Function definition does not bind loop variable agg (B023)

src/MEDS_tabular_automl/scripts/tabularize_static.py

13-13: Module level import not at top of file (E402)


15-15: Module level import not at top of file (E402)


17-22: Module level import not at top of file (E402)


23-23: Module level import not at top of file (E402)


24-24: Module level import not at top of file (E402)


25-25: Module level import not at top of file (E402)


26-34: Module level import not at top of file (E402)


142-142: Function definition does not bind loop variable agg (B023)

docs/source/conf.py

61-61: Module level import not at top of file (E402)

src/MEDS_tabular_automl/scripts/launch_xgboost.py

90-90: Remove unnecessary True if ... else False (SIM210)

Remove unnecessary True if ... else False

Additional comments not posted (15)
docs/requirements.txt (1)

4-4: Ensure the new dependencies sphinx_immaterial and esgpt are compatible with the project's existing environment.

Also applies to: 14-14

docs/make.bat (1)

13-14: Good addition of a conditional check to guide users when no arguments are provided.

src/MEDS_tabular_automl/scripts/tabularize_time_series.py (1)

18-23: Ensure the relative imports are correctly resolving the modules.

Tools
Ruff

18-18: Module level import not at top of file (E402)


19-19: Module level import not at top of file (E402)


20-20: Module level import not at top of file (E402)


21-21: Module level import not at top of file (E402)


22-22: Module level import not at top of file (E402)

src/MEDS_tabular_automl/scripts/tabularize_static.py (1)

17-26: Ensure the relative imports are correctly resolving the modules.

Verification successful

The relative imports in the src/MEDS_tabular_automl/scripts/tabularize_static.py file are consistent and correctly resolve the modules as they are used across multiple files in the project without any issues.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the resolution of relative imports after refactoring.

# Test: Search for the usage of these modules in the project to ensure no unresolved imports.
fd --exec rg 'from ..describe_codes import'
fd --exec rg 'from ..file_name import'
fd --exec rg 'from ..generate_static_features import'
fd --exec rg 'from ..mapper import'
fd --exec rg 'from ..utils import'

Length of output: 6598

Tools
Ruff

17-22: Module level import not at top of file (E402)


23-23: Module level import not at top of file (E402)


24-24: Module level import not at top of file (E402)


25-25: Module level import not at top of file (E402)

README.md (3)

49-57: The changes in the "Scripts and Examples" section enhance the clarity and usefulness of the documentation. Good job!


127-127: The "Additional CLI Scripts" section is well-written and provides useful information on the new CLI script.


152-152: The explanation of "tabular pipelines" is insightful and helps clarify common misconceptions about structured EHR data.

docs/source/conf.py (5)

15-18: The update to the project information, including the new version number and additional authors, is accurate and up-to-date.


85-96: The addition of new Sphinx extensions will enhance the documentation capabilities, providing more features such as todo items, code views, and coverage reports.


100-100: The change to the "pydata_sphinx_theme" is a good choice for enhancing the visual appeal and functionality of the documentation.


175-175: The simplification of the source suffix to only include ".rst" and ".md" makes the documentation sources more consistent.


135-135: Adding common build and system files to the exclude patterns is a good practice to keep the documentation build clean.

src/MEDS_tabular_automl/utils.py (1)

81-81: The addition of replace=True in the OmegaConf.register_new_resolver call is a good practice to ensure the resolver is correctly registered without conflicts.

src/MEDS_tabular_automl/scripts/launch_xgboost.py (2)

16-18: Relative imports improve modularity and maintainability of the codebase.


16-18: Ensure that the relative paths are correctly resolving to the intended modules.

Verification successful

The relative import paths in launch_xgboost.py are correctly resolving to the intended modules.

  • get_feature_columns is imported from describe_codes.
  • get_model_files and list_subdir_files are defined in file_name.py.
  • get_feature_indices and hydra_loguru_init are defined in utils.py.
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify relative import paths resolve correctly.

# Test: Search for the function usage. Expect: Only occurances of the new signature.
ast-grep --lang python --pattern $'from ..describe_codes import get_feature_columns'

Length of output: 189



Script:

#!/bin/bash
# Description: Verify relative import paths resolve correctly.

# Verify the existence of `get_model_files` and `list_subdir_files` in `file_name.py`
ast-grep --lang python --pattern $'def get_model_files($_) { $$$ }' | grep 'file_name.py'
ast-grep --lang python --pattern $'def list_subdir_files($_) { $$$ }' | grep 'file_name.py'

# Verify the existence of `get_feature_indices` and `hydra_loguru_init` in `utils.py`
ast-grep --lang python --pattern $'def get_feature_indices($_) { $$$ }' | grep 'utils.py'
ast-grep --lang python --pattern $'def hydra_loguru_init($_) { $$$ }' | grep 'utils.py'

Length of output: 2179



Script:

#!/bin/bash
# Description: Verify relative import paths resolve correctly.

# Verify the existence of `get_model_files` and `list_subdir_files` in `file_name.py`
ast-grep --lang python --pattern 'def get_model_files' | grep 'file_name.py'
ast-grep --lang python --pattern 'def list_subdir_files' | grep 'file_name.py'

# Verify the existence of `get_feature_indices` and `hydra_loguru_init` in `utils.py`
ast-grep --lang python --pattern 'def get_feature_indices' | grep 'utils.py'
ast-grep --lang python --pattern 'def hydra_loguru_init' | grep 'utils.py'

Length of output: 1687



Script:

#!/bin/bash
# Description: Verify relative import paths resolve correctly.

# Verify the existence of `get_model_files` and `list_subdir_files` in `file_name.py`
rg 'def get_model_files' | grep 'file_name.py'
rg 'def list_subdir_files' | grep 'file_name.py'

# Verify the existence of `get_feature_indices` and `hydra_loguru_init` in `utils.py`
rg 'def get_feature_indices' | grep 'utils.py'
rg 'def hydra_loguru_init' | grep 'utils.py'

Length of output: 577


- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space!

By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider rephrasing for conciseness and clarity.

- This approach not only simplifies the process but also ensures high-quality, reproducible results for ...
+ This approach simplifies the process and ensures high-quality, reproducible results for ...
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!
By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach simplifies the process and ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!
Tools
LanguageTool

[style] ~48-~48: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...ithin the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...


[style] ~48-~48: Using many exclamation marks might seem excessive (in this case: 8 exclamation marks for a text that’s 2711 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!

@@ -0,0 +1,14 @@
# How does MEDS-Tab Work?

#### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjust the heading level to increment by one from the previous level for better readability and structure.

-#### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?
+### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?
### What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?
Tools
Markdownlint

3-3: Expected: h2; Actual: h4 (MD001, heading-increment)
Heading levels should only increment by one level at a time


**Local Install**

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify the language for fenced code blocks to enhance readability and syntax highlighting.

-```
+```bash
Tools
Markdownlint

31-31: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

README.md Outdated

See `tests/test_integration.py` for an example of the end-to-end pipeline being run on synthetic data. This
script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.

For an end to end example over MIMIC-IV, see the [companion repository](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV)
For an end to end example over Philips eICU, see the [eICU companion repository](https://github.com/mmcdermott/MEDS_TAB_EICU).

### Core CLI Scripts Overview
## Core CLI Scripts Overview

1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the punctuation and grammar in the descriptions of the CLI scripts to ensure clarity and professionalism.

- **`meds-tab-describe`**: This command processes MEDS data shards...
+ **`meds-tab-describe`**: This command processes MEDS data shards...
- **`meds-tab-tabularize-static`**: Filters and processes the dataset based...
+ **`meds-tab-tabularize-static`**: Filters and processes the dataset based...
- **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard...
+ **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard...
- **`meds-tab-xgboost`**: Trains an XGBoost model using user-specific parameters.
+ **`meds-tab-xgboost`**: Trains an XGBoost model using user-specific parameters.
- **`meds-tab-xgboost-sweep`**: Conducts an Optuna hyperparameter sweep...
+ **`meds-tab-xgboost-sweep`**: Conducts an Optuna hyperparameter sweep...
- **`generate-permutations`**: Generates and prints a sorted list of all permutations...
+ **`generate-permutations`**: Generates and prints a sorted list of all permutations...

Also applies to: 72-72, 84-84, 99-99, 112-112, 125-125, 129-129

Tools
LanguageTool

[uncategorized] ~59-~59: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...


### Additional CLI Scripts

1. **`generate-permutations`**: Generates and prints a sorted list of all permutations from a comma separated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address markdown linting issues by specifying a language for fenced code blocks.

- ```bash
+ ```bash
Tools
LanguageTool

[uncategorized] ~94-~94: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...


4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.

**Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct the grammar to improve readability.

- Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
+ Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.

Committable suggestion was skipped due to low confidence.

Tools
LanguageTool

[grammar] ~66-~66: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 2f72bb8 and b0365a6.

Files selected for processing (1)
  • docs/requirements.txt (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • docs/requirements.txt

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b0365a6 and b2655a6.

Files selected for processing (1)
  • .readthedocs.yaml (1 hunks)
Files skipped from review due to trivial changes (1)
  • .readthedocs.yaml

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Outside diff range and nitpick comments (1)
README.md (1)

Line range hint 137-137: Add a comma after "for example" for grammatical correctness.

- For example you can directly call **`generate-permutations`** in the command line:
+ For example, you can directly call **`generate-permutations`** in the command line:
Tools
LanguageTool

[uncategorized] ~69-~69: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...


[uncategorized] ~78-~78: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...


[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...


[uncategorized] ~92-~92: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

Markdownlint

47-47: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


53-53: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b2655a6 and 85bfd5e.

Files selected for processing (3)
  • .readthedocs.yaml (1 hunks)
  • README.md (6 hunks)
  • docs/source/index.md (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • .readthedocs.yaml
Additional context used
LanguageTool
docs/source/index.md

[style] ~34-~34: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...


[style] ~34-~34: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 1704 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!

README.md

[uncategorized] ~69-~69: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...


[uncategorized] ~78-~78: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...


[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...


[uncategorized] ~92-~92: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...


[uncategorized] ~107-~107: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...


[grammar] ~109-~109: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...


[uncategorized] ~120-~120: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...


[uncategorized] ~135-~135: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...


[typographical] ~137-~137: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...


[style] ~170-~170: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...


[typographical] ~202-~202: It appears that a comma is missing. (DURING_THAT_TIME_COMMA)
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...


[uncategorized] ~208-~208: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...


[style] ~265-~265: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative. (A_LOT_OF)
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...


[style] ~294-~294: Consider using a different verb to strengthen your wording. (SPEED_UP_ACCELERATE)
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...


[uncategorized] ~330-~330: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... more memory efficient version of their method which we denote catabra-mem. Other li...


[uncategorized] ~330-~330: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...


[style] ~334-~334: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...


[style] ~334-~334: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...


[uncategorized] ~334-~334: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...


[uncategorized] ~462-~462: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

Markdownlint
README.md

19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


36-36: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


185-185: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


309-309: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


400-400: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


47-47: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


53-53: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


520-520: null (MD038, no-space-in-code)
Spaces inside code span elements


31-31: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


215-215: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


276-276: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


414-414: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


484-484: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

Additional comments not posted (4)
docs/source/index.md (3)

3-3: Ensure the introduction is engaging and clearly states the purpose of MEDS-Tab.


5-15: The table of contents is well-structured and provides a clear path for navigating the documentation.


19-21: This section effectively highlights the benefits of using MEDS-Tab.

README.md (1)

43-45: Ensure consistency in documentation.

The installation instructions are duplicated here which might confuse users. Consider removing this section or integrating it with the primary installation instructions at the beginning of the document.

Comment on lines 25 to 27
- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions
- Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bullet points here succinctly communicate the advantages of MEDS-Tab. Consider adding specific metrics or examples to further bolster these claims.

- - Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks.
+ - Strong Performance: MEDS-Tab consistently outperforms other frameworks in benchmarks such as [specific benchmark], demonstrating its robustness and high performance.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions
- Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks.
- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions
- Strong Performance: MEDS-Tab consistently outperforms other frameworks in benchmarks such as [specific benchmark], demonstrating its robustness and high performance.

README.md Outdated
Comment on lines 40 to 41
1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date.
2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrase the description for clarity and conciseness.

- 1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date.
+ 1. Efficient construction and loading of flat, non-longitudinal summary features for patient records in MEDS, covering arbitrary time-windows (e.g., 1 year, 6 months) backward from a specified index date.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date.
2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems.
1. Efficient construction and loading of flat, non-longitudinal summary features for patient records in MEDS, covering arbitrary time-windows (e.g., 1 year, 6 months) backward from a specified index date.
2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems.

README.md Outdated
@@ -98,9 +106,9 @@

4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.

**Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)
**Example: Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct grammatical error in the description.

- **Example: Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)
+ **Example: Align Tabularized Data**: For a specific task `$TASK`, align labels that have been pulled from [ACES](https://github.com/justin13601/ACES).
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Example: Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)
**Example: Align Tabularized Data**: For a specific task `$TASK`, align labels that have been pulled from [ACES](https://github.com/justin13601/ACES).
Tools
LanguageTool

[grammar] ~109-~109: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

README.md Outdated
Comment on lines 78 to 80
2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.

**Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
**Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify and simplify the example description.

- **Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d,  365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
+ **Example: Tabularizing Static Data**: Minimum code frequency: 10, Window sizes: `[1d, 30d, 365d, full]`, Aggregation methods: `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.
**Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
**Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d, 365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.
**Example: Tabularizing Static Data**: Minimum code frequency: 10, Window sizes: `[1d, 30d, 365d, full]`, Aggregation methods: `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
Tools
LanguageTool

[uncategorized] ~78-~78: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...


[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 85bfd5e and f6e229a.

Files selected for processing (4)
  • docs/source/implementation.md (1 hunks)
  • docs/source/overview.md (1 hunks)
  • docs/source/prediction.md (1 hunks)
  • docs/source/profiling.md (1 hunks)
Additional context used
LanguageTool
docs/source/implementation.md

[typographical] ~3-~3: It appears that a comma is missing. (DURING_THAT_TIME_COMMA)
Context: ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...


[uncategorized] ~9-~9: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...


[style] ~66-~66: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative. (A_LOT_OF)
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...


[uncategorized] ~74-~74: A determiner appears to be missing. Consider inserting it. (AI_EN_LECTOR_MISSING_DETERMINER)
Context: ...ow sizes. 5. Output Storage: - Sparse array is converted to Coordinate List f...


[style] ~95-~95: Consider using a different verb to strengthen your wording. (SPEED_UP_ACCELERATE)
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

docs/source/profiling.md

[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... more memory efficient version of their method which we denote catabra-mem. Other li...


[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...


[style] ~26-~26: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...


[style] ~26-~26: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...


[uncategorized] ~26-~26: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... that catabra-mem is omitted from the tables as it never completed within the 10 min...


[uncategorized] ~26-~26: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...

docs/source/overview.md

[uncategorized] ~34-~34: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...


[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...


[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...


[uncategorized] ~57-~57: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...


[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...


[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...


[uncategorized] ~85-~85: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...


[uncategorized] ~85-~85: You might be missing the article “the” here. (AI_EN_LECTOR_MISSING_DETERMINER_THE)
Context: ...izesandaggscan be generated usinggenerate-permutations` command (See the ...


[uncategorized] ~100-~100: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...


[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...


[uncategorized] ~102-~102: The preposition “on” seems more likely in this position than the preposition “in”. (AI_EN_LECTOR_REPLACEMENT_PREPOSITION_IN_ON)
Context: ...rectly call generate-permutations in the command line: ```bash genera...


[uncategorized] ~125-~125: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...


[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

docs/source/prediction.md

[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

Markdownlint
docs/source/implementation.md

16-16: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


77-77: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

docs/source/profiling.md

90-90: null (MD047, single-trailing-newline)
Files should end with a single newline character

docs/source/overview.md

150-150: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

docs/source/prediction.md

121-121: null (MD038, no-space-in-code)
Spaces inside code span elements


15-15: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


85-85: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients:

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify language for fenced code blocks to adhere to Markdown best practices.

- ```
+ ```plaintext

Also applies to: 77-77

Tools
Markdownlint

16-16: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


1. Describe codes (compute feature frequencies)
2. Given time series data tabularize it
3. cache task specific rows of data for efficient loading
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hyphenate "task-specific" for grammatical correctness.

- 3. cache task specific rows of data for efficient loading
+ 3. cache task-specific rows of data for efficient loading
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
3. cache task specific rows of data for efficient loading
3. cache task-specific rows of data for efficient loading
Tools
LanguageTool

[uncategorized] ~9-~9: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...


5. **Output Storage**:

- Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert "a" before "Sparse array" to correct the determiner omission.

- Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.
+ A Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.
- A Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.
Tools
LanguageTool

[uncategorized] ~74-~74: A determiner appears to be missing. Consider inserting it. (AI_EN_LECTOR_MISSING_DETERMINER)
Context: ...ow sizes. 5. Output Storage: - Sparse array is converted to Coordinate List f...


## MEDS-Tab Tabularization Technique

Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMICIV and EICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for EICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a comma after "method" for better readability.

- ...more memory efficient version of their method which we denote `catabra-mem`.
+ ...more memory efficient version of their method, which we denote `catabra-mem`.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMICIV and EICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for EICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method, which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMICIV and EICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for EICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.
Tools
LanguageTool

[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... more memory efficient version of their method which we denote catabra-mem. Other li...


[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...

| --------- | ----------- | ----------- | ----------- | -------- |
| 0m15.867s | 1,410.79 MB | 3,539.32 MB | 442 MB | meds_tab |

______________________________________________________________________
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure the file ends with a single newline character.

+ \n
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
______________________________________________________________________
______________________________________________________________________
Tools
Markdownlint

90-90: null (MD047, single-trailing-newline)
Files should end with a single newline character

3. We need to ensure full and seamless compatibility with the ACES CLI tool, rather than relying on the python API and manual adjustments:
[#34](https://github.com/mmcdermott/MEDS_Tabular_AutoML/issues/34)

# How does MEDS-Tab Work?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid multiple top-level headings in the same document to adhere to Markdown best practices.

- # How does MEDS-Tab Work?
+ ## How does MEDS-Tab Work?
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# How does MEDS-Tab Work?
## How does MEDS-Tab Work?
Tools
Markdownlint

150-150: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


A single XGBoost run was completed to profile time and memory usage. This was done for each `$TASK` using the following command:

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify language for fenced code blocks to adhere to Markdown best practices.

- ```
+ ```bash

Also applies to: 85-85

Tools
Markdownlint

15-15: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

| LOS in Hospital > 3 days | Admission + 24 hr | 6m4.884s | 7m5.025s | 1m4.335s | 11011.710 | 12223.449 |
| LOS in Hospital > 3 days | Admission + 48 hr | 6m9.587s | 7m12.853s | 1m3.858s | 10703.064 | 11829.742 |

#### 1.1.2 MIMIC-IV Task Specific Training Cohort Size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hyphenate "Task-Specific" for grammatical correctness.

- #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size
+ #### 1.1.2 MIMIC-IV Task-Specific Training Cohort Size
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#### 1.1.2 MIMIC-IV Task Specific Training Cohort Size
#### 1.1.2 MIMIC-IV Task-Specific Training Cohort Size
Tools
LanguageTool

[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between f6e229a and 0227900.

Files selected for processing (1)
  • docs/source/prediction.md (1 hunks)
Files not reviewed due to errors (1)
  • docs/source/prediction.md (no review received)
Additional context used
LanguageTool
docs/source/prediction.md

[uncategorized] ~58-~58: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...


[uncategorized] ~186-~186: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ... | 14 | #### 2.1.3 eICU Task Specific Training Cohort Size | Task ...

Markdownlint
docs/source/prediction.md

2-2: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


4-4: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


42-42: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


110-110: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


113-113: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


132-132: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


152-152: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


112-112: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


113-113: Expected: 1; Actual: 3 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


114-114: Expected: 1; Actual: 4 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


134-134: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


185-185: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


198-198: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


199-199: Expected: 1; Actual: 3 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


76-76: null (MD019, no-multiple-space-atx)
Multiple spaces after hash on atx style heading


43-43: Expected: 1; Actual: 0; Above (MD022, blanks-around-headings)
Headings should be surrounded by blank lines


43-43: Expected: 1; Actual: 0; Below (MD022, blanks-around-headings)
Headings should be surrounded by blank lines


135-135: Expected: 1; Actual: 0; Below (MD022, blanks-around-headings)
Headings should be surrounded by blank lines


152-152: Expected: 1; Actual: 0; Below (MD022, blanks-around-headings)
Headings should be surrounded by blank lines


153-153: Expected: 1; Actual: 0; Above (MD022, blanks-around-headings)
Headings should be surrounded by blank lines


186-186: Expected: 1; Actual: 0; Below (MD022, blanks-around-headings)
Headings should be surrounded by blank lines


15-15: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines


21-21: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines


23-23: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines


41-41: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines


79-79: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines


87-87: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines


89-89: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines


96-96: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines


98-98: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines


109-109: null (MD031, blanks-around-fences)
Fenced code blocks should be surrounded by blank lines


132-132: null (MD032, blanks-around-lists)
Lists should be surrounded by blank lines


110-110: null (MD038, no-space-in-code)
Spaces inside code span elements


15-15: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


79-79: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 0227900 and 1ab0405.

Files selected for processing (4)
  • docs/source/index.md (1 hunks)
  • docs/source/overview.md (1 hunks)
  • docs/source/prediction.md (1 hunks)
  • docs/source/profiling.md (1 hunks)
Files not reviewed due to errors (1)
  • docs/source/prediction.md (no review received)
Additional context used
LanguageTool
docs/source/index.md

[style] ~34-~34: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...


[style] ~34-~34: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 1704 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!

docs/source/profiling.md

[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...d that on the MIMIC-IV and eICU medical datasets we significantly outperform past method...


[style] ~26-~26: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...


[style] ~26-~26: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...


[uncategorized] ~26-~26: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...

docs/source/overview.md

[uncategorized] ~34-~34: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...


[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...


[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...


[uncategorized] ~57-~57: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...


[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...


[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...


[uncategorized] ~85-~85: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...


[uncategorized] ~100-~100: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...


[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...


[uncategorized] ~125-~125: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...


[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

docs/source/prediction.md

[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...


[uncategorized] ~195-~195: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ... | 14 | #### 2.1.3 eICU Task Specific Training Cohort Size | Task ...

Markdownlint
docs/source/overview.md

12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

docs/source/prediction.md

121-121: null (MD038, no-space-in-code)
Spaces inside code span elements


15-15: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


85-85: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

Additional comments not posted (2)
docs/source/index.md (1)

25-27: The bullet points here succinctly communicate the advantages of MEDS-Tab. Consider adding specific metrics or examples to further bolster these claims.

- - Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks.
+ - Strong Performance: MEDS-Tab consistently outperforms other frameworks in benchmarks such as [specific benchmark], demonstrating its robustness and high performance.

Likely invalid or redundant comment.

docs/source/overview.md (1)

74-74: Correct the grammar to improve readability.

- Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
+ Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.

Likely invalid or redundant comment.

Tools
LanguageTool

[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

Comment on lines 3 to 34
MEDS-Tab is a library designed for automated tabularization, data preparation with aggregation, and windowing. Check out below for an overview of MEDS-Tab and how it could be useful in your workflows!

```{toctree}
---
glob:
maxdepth: 2
---
Overview <overview>
Pipeline/Architecture Details <implementation>
Computational Profile <profiling>
Prediction Performance <prediction>
Module API Reference <api/modules>
```

______________________________________________________________________

## Why MEDS-Tab?

MEDS-Tab is a comprehensive framework designed to streamline the handling, modeling, and analysis of complex medical time-series data. By leveraging automated processes, MEDS-Tab significantly reduces the computation required to generate high-quality baseline models for diverse supervised learning tasks.

### I. Transform to MEDS

- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions
- Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks.

### II. Run MEDS-Tab

- Run the MEDS-Tab Command-Line Interface tool (`MEDS-Tab-cli`) to extract cohorts based on your task - check out the [Usage Guide](https://meds-tab--36.org.readthedocs.build/en/36/overview.html#core-cli-scripts-overview)!

- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space!

By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider rephrasing for conciseness and clarity.

- This approach not only simplifies the process but also ensures high-quality, reproducible results for ...
+ This approach simplifies the process and ensures high-quality, reproducible results for ...
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
MEDS-Tab is a library designed for automated tabularization, data preparation with aggregation, and windowing. Check out below for an overview of MEDS-Tab and how it could be useful in your workflows!
```{toctree}
---
glob:
maxdepth: 2
---
Overview <overview>
Pipeline/Architecture Details <implementation>
Computational Profile <profiling>
Prediction Performance <prediction>
Module API Reference <api/modules>
```
______________________________________________________________________
## Why MEDS-Tab?
MEDS-Tab is a comprehensive framework designed to streamline the handling, modeling, and analysis of complex medical time-series data. By leveraging automated processes, MEDS-Tab significantly reduces the computation required to generate high-quality baseline models for diverse supervised learning tasks.
### I. Transform to MEDS
- Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions
- Strong Performance: MEDS-Tab produces robust and high-performance across various datasets compared with other frameworks.
### II. Run MEDS-Tab
- Run the MEDS-Tab Command-Line Interface tool (`MEDS-Tab-cli`) to extract cohorts based on your task - check out the [Usage Guide](https://meds-tab--36.org.readthedocs.build/en/36/overview.html#core-cli-scripts-overview)!
- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space!
By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!
By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the MEDS-Tab ecosystem. This approach simplifies the process and ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!
Tools
LanguageTool

[style] ~34-~34: Opting for a less wordy alternative here can improve the clarity of your writing. (NOT_ONLY_ALSO)
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...


[style] ~34-~34: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 1704 characters long) (EN_EXCESSIVE_EXCLAMATION)
Context: ... datasets in reasonable raw formulations!


## 2. Comparative Performance Analysis

The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace "very" with a more precise adjective.

- ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly.
+ ...use a budget of 10 minutes as these are a small number of patients (10, 100, and 500 patients), and should be processed quickly.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are a small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
Tools
LanguageTool

[style] ~26-~26: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...


[style] ~26-~26: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...


[uncategorized] ~26-~26: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...

Comment on lines 22 to 26
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMIC-IV and eICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.

## 2. Comparative Performance Analysis

The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "of" after "all" for conciseness.

- ...ing the better performance of MEDS-Tab in all of the scenarios.
+ ...ing the better performance of MEDS-Tab in all the scenarios.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMIC-IV and eICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.
## 2. Comparative Performance Analysis
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMIC-IV and eICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.
## 2. Comparative Performance Analysis
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
Tools
LanguageTool

[uncategorized] ~22-~22: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...d that on the MIMIC-IV and eICU medical datasets we significantly outperform past method...


[style] ~26-~26: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...


[style] ~26-~26: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...


[uncategorized] ~26-~26: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 215ab04 and 88320ea.

Files selected for processing (2)
  • README.md (3 hunks)
  • docs/source/prediction.md (1 hunks)
Files not reviewed due to errors (1)
  • docs/source/prediction.md (no review received)
Additional context used
LanguageTool
docs/source/prediction.md

[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ... 11829.742 | #### 1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...


[uncategorized] ~195-~195: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ... | 14 | #### 3. eICU Task Specific Training Cohort Size | Task ...

README.md

[uncategorized] ~69-~69: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...


[uncategorized] ~78-~78: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...


[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...


[uncategorized] ~92-~92: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...


[uncategorized] ~107-~107: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...


[grammar] ~109-~109: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...


[uncategorized] ~120-~120: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...


[uncategorized] ~135-~135: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...


[typographical] ~137-~137: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...


[uncategorized] ~160-~160: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...


[style] ~170-~170: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...


[typographical] ~202-~202: It appears that a comma is missing. (DURING_THAT_TIME_COMMA)
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...


[uncategorized] ~208-~208: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...


[style] ~265-~265: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative. (A_LOT_OF)
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...


[style] ~294-~294: Consider using a different verb to strengthen your wording. (SPEED_UP_ACCELERATE)
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...


[uncategorized] ~330-~330: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...


[style] ~334-~334: Consider removing “of” to be more concise (ALL_OF_THE)
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...


[style] ~334-~334: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...


[uncategorized] ~334-~334: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ... that catabra-mem is omitted from the tables as it never completed within the 10 min...


[uncategorized] ~334-~334: When a number forms part of an adjectival compound, use a hyphen. (MISSING_HYPHEN)
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...


[uncategorized] ~462-~462: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

Markdownlint
docs/source/prediction.md

121-121: null (MD038, no-space-in-code)
Spaces inside code span elements

README.md

19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


36-36: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


185-185: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


309-309: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


400-400: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


47-47: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


53-53: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


520-520: null (MD038, no-space-in-code)
Spaces inside code span elements


215-215: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


276-276: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


414-414: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


484-484: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

Additional comments not posted (9)
README.md (9)

25-25: Installation command is clear and concise.

This line provides a straightforward command for installing the package via pip, which is crucial for user accessibility.


31-31: Local installation instructions are clear.

The instructions for cloning the repository and installing the package locally are clear, promoting easy setup for developers.


76-76: Explanation of caching feature names and frequencies is clear.

The documentation clearly explains how the meds-tab-describe script computes and caches feature names and frequencies, which is essential for understanding the data processing workflow.


40-42: Clarify the description of tabular feature construction and usage.

The description of constructing and using tabular features can be enhanced for better clarity and precision.

- 1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date.
+ 1. Efficient construction and loading of flat, non-longitudinal summary features for patient records in MEDS, covering arbitrary time-windows (e.g., 1 year, 6 months) backward from a specified index date.

80-80: Example command for static data tabularization needs clarification.

The example command provided for static data tabularization is detailed but could be simplified for better readability.

- **Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d,  365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
+ **Example: Tabularizing Static Data**: Minimum code frequency: 10, Window sizes: `[1d, 30d, 365d, full]`, Aggregation methods: `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`

109-109: Grammar correction needed in example description.

The description has a grammatical error that needs correction for clarity.

- **Example: Aligh tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)
+ **Example: Align Tabularized Data**: For a specific task `$TASK`, align labels that have been pulled from [ACES](https://github.com/justin13601/ACES).
Tools
LanguageTool

[grammar] ~109-~109: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...


133-133: Clarify the functionality of the generate-permutations command.

The description of the generate-permutations command can be improved for better understanding.

- 1. **`generate-permutations`**: Generates and prints a sorted list of all permutations from a comma separated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations.
+ 1. **`generate-permutations`**: Generates and prints all permutations from a comma-separated input, aiding in sweeping over combinations of window sizes and aggregations.

170-170: Simplify language for clarity.

The language used in the description can be simplified for better readability.

- We should likely decorrelate the default aggregations and/or window sizes we use prior to passing them into the models as features.
+ We should likely decorrelate the default aggregations and window sizes before using them as features in the models.
Tools
LanguageTool

[style] ~170-~170: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...


202-202: Insert a comma for correct punctuation.

A comma is missing in the sentence, which is necessary for correct punctuation and readability.

- In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
+ In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
Tools
LanguageTool

[typographical] ~202-~202: It appears that a comma is missing. (DURING_THAT_TIME_COMMA)
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 88320ea and 4a57e11.

Files selected for processing (2)
  • docs/source/installation.md (1 hunks)
  • docs/source/overview.md (1 hunks)
Additional context used
Markdownlint
docs/source/installation.md

19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

docs/source/overview.md

12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

LanguageTool
docs/source/overview.md

[uncategorized] ~34-~34: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...


[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...


[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...


[uncategorized] ~57-~57: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...


[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...


[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...


[uncategorized] ~85-~85: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...


[uncategorized] ~100-~100: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...


[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...


[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

Additional comments not posted (1)
docs/source/overview.md (1)

102-102: Add a comma after introductory phrase for clarity.

The phrase "For example" should be followed by a comma to improve readability.

- For example you can directly call **`generate-permutations`**
+ For example, you can directly call **`generate-permutations`**

Likely invalid or redundant comment.

Tools
LanguageTool

[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...


This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.

2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify conjunction usage in sentence.

The use of "thus" in the description of meds-tab-tabularize-static might confuse readers as it suggests a cause-effect relationship not clearly stated.

- thus rows are duplicated across multiple timestamps for the same patient.
+ Consequently, rows are duplicated across multiple timestamps for the same patient.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.
2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, consequently, rows are duplicated across multiple timestamps for the same patient.
Tools
LanguageTool

[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...


[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 4a57e11 and ba90809.

Files selected for processing (2)
  • docs/source/installation.md (1 hunks)
  • docs/source/overview.md (1 hunks)
Additional context used
Markdownlint
docs/source/installation.md

19-19: null (MD025, single-title, single-h1)
Multiple top-level headings in the same document


23-23: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


29-29: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

docs/source/overview.md

12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

LanguageTool
docs/source/overview.md

[uncategorized] ~34-~34: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...


[uncategorized] ~43-~43: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...


[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence. (THUS_SENTENCE)
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...


[uncategorized] ~57-~57: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...


[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...


[grammar] ~74-~74: Possible subject-verb agreement error detected. (PLURAL_THAT_AGREEMENT)
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...


[uncategorized] ~85-~85: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...


[uncategorized] ~100-~100: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...


[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used. (COMMA_FOR_EXAMPLE)
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...


[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

Comment on lines 1 to 13
# Scalable tabularization and tabular feature usage utilities over generic MEDS datasets

<p align="center">
<a href="https://www.python.org/downloads/release/python-3100/"><img alt="Python" src="https://img.shields.io/badge/-Python_3.12+-blue?logo=python&logoColor=white"></a>
<a href="https://pypi.org/project/meds-tab/"><img alt="PyPI" src="https://img.shields.io/badge/PyPI-v0.2.5-orange?logoColor=orange"></a>
<a href="https://hydra.cc/"><img alt="Hydra" src="https://img.shields.io/badge/Config-Hydra_1.3-89b8cd"></a>
<a href="https://codecov.io/gh/mmcdermott/MEDS_Tabular_AutoML"><img alt="Codecov" src="https://codecov.io/gh/mmcdermott/MEDS_Tabular_AutoML/graph/badge.svg?token=6GD05EDQ39"></a>
<a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML/actions/workflows/tests.yaml"><img alt="Tests" src="https://github.com/mmcdermott/MEDS_Tabular_AutoML/actions/workflows/tests.yaml/badge.svg"></a>
<a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML/actions/workflows/code-quality-main.yaml"><img alt="Code Quality" src="https://github.com/mmcdermott/MEDS_Tabular_AutoML/actions/workflows/code-quality-main.yaml/badge.svg"></a>
<a href='https://meds-tab.readthedocs.io/en/latest/?badge=latest'><img src='https://readthedocs.org/projects/meds-tab/badge/?version=latest' alt='Documentation Status' /></a>
<a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML/graphs/contributors"><img alt="Contributors" src="https://img.shields.io/github/contributors/mmcdermott/MEDS_Tabular_AutoML.svg"></a>
<a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML/pulls"><img alt="Pull Requests" src="https://img.shields.io/badge/PRs-welcome-brightgreen.svg"></a>
<a href="https://github.com/mmcdermott/MEDS_Tabular_AutoML#license"><img alt="License" src="https://img.shields.io/badge/License-MIT-green.svg?labelColor=gray"></a>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve badge alignment and presentation.

Consider using a table or a more structured layout for the badges to enhance readability and visual appeal.

Comment on lines +1 to +2
# Usage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify the repository's purpose in the introduction.

The introduction could be expanded to provide more details about the specific capabilities and advantages of the repository.

Comment on lines +8 to +30
## Quick Start

To use MEDS-Tab, install the dependencies following commands below:

**Pip Install**

```console
pip install meds-tab
```

**Local Install**

```console
# clone the git repo
pip install .
```

## Scripts and Examples

For an end to end example over MIMIC-IV, see the [MIMIC-IV companion repository](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV).
For an end to end example over Philips eICU, see the [eICU companion repository](https://github.com/mmcdermott/MEDS_TAB_EICU).

See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enhance the Quick Start and Scripts sections with more detailed examples.

Providing more detailed examples and explanations can help new users better understand how to use the repository effectively.

Tools
Markdownlint

12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


18-18: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading


To use MEDS-Tab, install the dependencies following commands below:

**Pip Install**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert emphasized text to headings for clarity and consistency.

Using headings instead of bold text for section titles can improve the structure and readability of the document.

-**Pip Install**
+### Pip Install

-**Local Install**
+### Local Install

Also applies to: 18-18

Tools
Markdownlint

12-12: null (MD036, no-emphasis-as-heading)
Emphasis used instead of a heading

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between ba90809 and c6cc059.

Files selected for processing (5)
  • README.md (3 hunks)
  • docs/source/implementation.md (1 hunks)
  • docs/source/installation.md (1 hunks)
  • docs/source/profiling.md (1 hunks)
  • pyproject.toml (1 hunks)
Files skipped from review due to trivial changes (1)
  • pyproject.toml
Additional context used
Markdownlint
docs/source/installation.md

19-19: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)


23-23: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


29-29: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

README.md

23-23: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


29-29: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


185-185: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)


47-47: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


53-53: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


309-309: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)


400-400: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)


520-520: null
Spaces inside code span elements

(MD038, no-space-in-code)


215-215: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


276-276: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


414-414: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


484-484: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

LanguageTool
docs/source/implementation.md

[uncategorized] ~14-~14: Possible missing comma found.
Context: ...reded dataset. We expect a structure as follows where each shard contains a subset of t...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~64-~64: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative.
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...

(A_LOT_OF)


[style] ~93-~93: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

(SPEED_UP_ACCELERATE)

docs/source/profiling.md

[uncategorized] ~20-~20: Possible missing comma found.
Context: ...w that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)


[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)

README.md

[uncategorized] ~69-~69: Loose punctuation mark.
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~78-~78: Loose punctuation mark.
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

(UNLIKELY_OPENING_PUNCTUATION)


[typographical] ~78-~78: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

(THUS_SENTENCE)


[uncategorized] ~92-~92: Loose punctuation mark.
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~107-~107: Loose punctuation mark.
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

(UNLIKELY_OPENING_PUNCTUATION)


[grammar] ~109-~109: Possible subject-verb agreement error detected.
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

(PLURAL_THAT_AGREEMENT)


[uncategorized] ~120-~120: Loose punctuation mark.
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~135-~135: Loose punctuation mark.
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

(UNLIKELY_OPENING_PUNCTUATION)


[typographical] ~137-~137: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

(COMMA_FOR_EXAMPLE)


[uncategorized] ~160-~160: Possible missing comma found.
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~170-~170: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

(EN_WORDINESS_PREMIUM_PRIOR_TO)


[typographical] ~202-~202: It appears that a comma is missing.
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...

(DURING_THAT_TIME_COMMA)


[uncategorized] ~208-~208: When ‘task-specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ...time series data tabularize it 3. cache task specific rows of data for efficient loading 4. X...

(SPECIFIC_HYPHEN)


[uncategorized] ~213-~213: Possible missing comma found.
Context: ...reded dataset. We expect a structure as follows where each shard contains a subset of t...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~265-~265: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative.
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...

(A_LOT_OF)


[style] ~294-~294: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

(SPEED_UP_ACCELERATE)


[uncategorized] ~330-~330: Possible missing comma found.
Context: ... more memory efficient version of their method which we denote catabra-mem. Other li...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~330-~330: Possible missing comma found.
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~334-~334: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)


[style] ~334-~334: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 pa...

(EN_WEAK_ADJECTIVE)


[uncategorized] ~334-~334: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ## eICU Dataset _____________...

(MISSING_HYPHEN)


[uncategorized] ~462-~462: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ...1829.742 | #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

(SPECIFIC_HYPHEN)

Additional comments not posted (10)
docs/source/installation.md (2)

3-13: Improve badge alignment and presentation.

Consider using a table or a more structured layout for the badges to enhance readability and visual appeal.


16-17: LGTM!

The description of the repository is clear and concise.

docs/source/implementation.md (4)

5-10: LGTM!

The description of the method's discrete parts is clear and concise.


36-37: LGTM!

The description of the data loading and sharding process is clear and concise.


70-72: LGTM!

The description of the output storage process is clear and concise.


100-106: LGTM!

The description of the XGBoost training process is clear and concise.

docs/source/profiling.md (3)

1-3: LGTM!

The heading and introductory paragraph are correctly formatted.


5-11: LGTM!

The description of the system comparison overview is clear and concise.


12-17: LGTM!

The description of the benchmarking tests is clear and concise.

README.md (1)

40-41: Rephrase for clarity and conciseness.

Improve the description for better readability.

- 1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) backwards in time from a given index date.
+ 1. Efficient construction and loading of flat, non-longitudinal summary features for patient records in MEDS, covering arbitrary time-windows (e.g., 1 year, 6 months) backward from a specified index date.

Likely invalid or redundant comment.

README.md Outdated
@@ -22,13 +22,13 @@ To use MEDS-Tab, install the dependencies following commands below:

**Pip Install**

```bash
```console
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use headings instead of emphasis for installation methods.

Replace emphasis with proper headings to adhere to markdown standards.

- **Pip Install**
+ ### Pip Install
- **Local Install**
+ ### Local Install

Also applies to: 31-31

### Scripts and Examples
To use MEDS-Tab, install the dependencies following commands below:

**Pip Install**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use headings instead of emphasis for installation methods.

Replace emphasis with proper headings to adhere to markdown standards.

- **Pip Install**
+ ### Pip Install
- **Local Install**
+ ### Local Install

Also applies to: 53-53

Tools
Markdownlint

47-47: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types
1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix grammatical issues and improve clarity.

Address loose punctuation marks and improve descriptions for better readability.

- 1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories:
+ 1. **`meds-tab-describe`**: Computes the frequencies of different code-types in MEDS data shards, categorizing them as:
- **Example: Tabularizing static data** with the minimum code frequency of 10 and window sizes of `[1d, 30d,  365d, full]` and value aggregation methods of `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
+ **Example: Tabularizing Static Data**: Minimum code frequency: 10, Window sizes: `[1d, 30d, 365d, full]`, Aggregation methods: `[static/present, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
- 3. **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard, `window_size`, and `aggregation` to generate feature vectors that aggregate patient data for each unique `patient_id` x `timestamp`.
+ 3. **`meds-tab-tabularize-time-series`**: Aggregates patient data for each unique `patient_id` x `timestamp` using combinations of `window_size` and `aggregation`.
- 4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
+ 4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
- 5. **`meds-tab-xgboost`**: Trains an XGBoost model using user-specified parameters. Permutations of `window_sizes` and `aggs` can be generated using `generate-permutations` command (See the section below for descriptions).
+ 5. **`meds-tab-xgboost`**: Trains an XGBoost model using user-specified parameters. Permutations of `window_sizes` and `aggs` can be generated using the `generate-permutations` command (see below for descriptions).
- For example you can directly call **`generate-permutations`** in the command line:
+ For example, you can directly call **`generate-permutations`** in the command line:

Also applies to: 78-80, 92-92, 107-107, 120-120, 135-137

Tools
LanguageTool

[uncategorized] ~69-~69: Loose punctuation mark.
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

(UNLIKELY_OPENING_PUNCTUATION)

README.md Outdated
## Implementation Improvements
## The MEDS-Tab Architecture

In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix grammatical issues and improve clarity.

Address missing commas, hyphenation, and wordiness for better readability.

- In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
+ In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization and XGBoost tuning processes.
- Given time series data tabularize it 3. cache task specific rows of data for efficient loading 4. XGBoost training
+ Given time series data tabularize it 3. cache task-specific rows of data for efficient loading 4. XGBoost training
- This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients:
+ This initial stage processes a pre-sharded dataset. We expect a structure as follows, where each shard contains a subset of the patients:
- Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly speeds up the rolling window aggregations on datasets that have lots of concurrent observations.
+ Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly accelerates the rolling window aggregations on datasets with many concurrent observations.
- This reduces the memory footprint and speeds up the training process.
+ This reduces the memory footprint and accelerates the training process.

Also applies to: 208-208, 213-213, 265-265, 294-294

Tools
LanguageTool

[typographical] ~202-~202: It appears that a comma is missing.
Context: ... ## The MEDS-Tab Architecture In this section we describe the MEDS-Tab architecture, ...

(DURING_THAT_TIME_COMMA)

README.md Outdated

## MEDS-Tab Tabularization Technique

Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. We find that on the MIMICIV and EICU medical datasets we significantly outperform past methods. `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for EICU, while our method can scale to process hundreds of patients with low memory usage. We present the results below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix grammatical issues and improve clarity.

Address missing commas, hyphenation, and wordiness for better readability.

- Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`.
+ Tabularization of time-series data, as depicted above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory-efficient version of their method, which we denote `catabra-mem`.
- The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are very small number of patients (10, 100, and 500 patients), and should be processed quickly.
+ The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. We additionally use a budget of 10 minutes as these are a small number of patients (10, 100, and 500 patients) and should be processed quickly.
- #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size
+ #### 1.1.2 MIMIC-IV Task-Specific Training Cohort Size

Also applies to: 334-334, 462-462

Tools
LanguageTool

[uncategorized] ~330-~330: Possible missing comma found.
Context: ... more memory efficient version of their method which we denote catabra-mem. Other li...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~330-~330: Possible missing comma found.
Context: ...nd that on the MIMICIV and EICU medical datasets we significantly outperform past method...

(AI_HYDRA_LEO_MISSING_COMMA)


To use MEDS-Tab, install the dependencies following commands below:

**Pip Install**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert emphasized text to headings for clarity and consistency.

Using headings instead of bold text for section titles improves the structure and readability of the document.

-**Pip Install**
+### Pip Install
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Pip Install**
### Pip Install
Tools
Markdownlint

23-23: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

Comment on lines +1 to +3
# The MEDS-Tab Architecture

In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a comma for better readability.

The introductory sentence could benefit from a comma after "In this section" to improve the flow and readability.

- In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
+ In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# The MEDS-Tab Architecture
In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
# The MEDS-Tab Architecture
In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.


## 1. Describe Codes (compute feature frequencies)

This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a comma after 'as follows'.

This would clarify the separation between the introductory clause and the detailed description that follows.

- This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients:
+ This initial stage processes a pre-shareded dataset. We expect a structure as follows, where each shard contains a subset of the patients:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients:
This initial stage processes a pre-shareded dataset. We expect a structure as follows, where each shard contains a subset of the patients:
Tools
LanguageTool

[uncategorized] ~14-~14: Possible missing comma found.
Context: ...reded dataset. We expect a structure as follows where each shard contains a subset of t...

(AI_HYDRA_LEO_MISSING_COMMA)


**Detailed Workflow:**

- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a stronger verb choice.

Replacing "speeds up" with "accelerates" might provide a stronger and more formal expression in the documentation.

- This reduces the memory footprint and speeds up the training process.
+ This reduces the memory footprint and accelerates the training process.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and accelerates the training process.
Tools
LanguageTool

[style] ~93-~93: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

(SPEED_UP_ACCELERATE)


3. **Event Aggregation**:

- Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly speeds up the rolling window aggregations on datasets that have lots of concurrent observations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider replacing "lots of" with "many" to enhance formality and clarity.

The phrase "lots of" might be considered informal. Using "many" can improve the formality and clarity of the documentation.

- ...ndow aggregations on datasets that have lots of concurrent observations.
+ ...ndow aggregations on datasets that have many concurrent observations.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly speeds up the rolling window aggregations on datasets that have lots of concurrent observations.
- Events that occur on the same date for the same patient are aggregated. This reduces redundancy in the data and significantly speeds up the rolling window aggregations on datasets that have many concurrent observations.
Tools
LanguageTool

[style] ~64-~64: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative.
Context: ...ndow aggregations on datasets that have lots of concurrent observations. 4. **Rolling ...

(A_LOT_OF)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c6cc059 and d1ac6b2.

Files ignored due to path filters (3)
  • docs/assets/dark_purple_meds_tab.png is excluded by !**/*.png
  • docs/assets/light_purple_meds_tab.png is excluded by !**/*.png
  • docs/assets/white_meds_tab.png is excluded by !**/*.png
Files selected for processing (7)
  • docs/generate.sh (1 hunks)
  • docs/source/_static/switcher.json (1 hunks)
  • docs/source/conf.py (2 hunks)
  • docs/source/implementation.md (1 hunks)
  • docs/source/index.md (1 hunks)
  • docs/source/overview.md (1 hunks)
  • src/MEDS_tabular_automl/init.py (1 hunks)
Files skipped from review due to trivial changes (2)
  • docs/generate.sh
  • src/MEDS_tabular_automl/init.py
Additional context used
Biome
docs/source/_static/switcher.json

[error] 12-12: Expected an array, an object, or a literal but instead found ']'.

Expected an array, an object, or a literal here.

(parse)

LanguageTool
docs/source/index.md

[style] ~28-~28: Consider a shorter alternative to avoid wordiness.
Context: ...ed across arbitrary tasks and settings. In order to use MEDS-Tab, you will first need to tr...

(IN_ORDER_TO_PREMIUM)


[style] ~39-~39: Opting for a less wordy alternative here can improve the clarity of your writing.
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...

(NOT_ONLY_ALSO)


[style] ~39-~39: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 2513 characters long)
Context: ... datasets in reasonable raw formulations!

(EN_EXCESSIVE_EXCLAMATION)

docs/source/implementation.md

[style] ~92-~92: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

(SPEED_UP_ACCELERATE)

docs/source/overview.md

[uncategorized] ~34-~34: Loose punctuation mark.
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~43-~43: Loose punctuation mark.
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

(UNLIKELY_OPENING_PUNCTUATION)


[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

(THUS_SENTENCE)


[uncategorized] ~57-~57: Loose punctuation mark.
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~72-~72: Loose punctuation mark.
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

(UNLIKELY_OPENING_PUNCTUATION)


[grammar] ~74-~74: Possible subject-verb agreement error detected.
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

(PLURAL_THAT_AGREEMENT)


[uncategorized] ~85-~85: Loose punctuation mark.
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~100-~100: Loose punctuation mark.
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

(UNLIKELY_OPENING_PUNCTUATION)


[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

(COMMA_FOR_EXAMPLE)


[uncategorized] ~125-~125: Possible missing comma found.
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~135-~135: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

(EN_WORDINESS_PREMIUM_PRIOR_TO)

Markdownlint
docs/source/overview.md

12-12: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


18-18: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

Additional comments not posted (14)
docs/source/index.md (4)

3-3: Enhance the introductory sentence for greater impact.

The current introduction could be more engaging. Consider emphasizing the unique features of MEDS-Tab early in the description.
[EXISTING_COMMENT]

- MEDS-Tab is a library designed for automated tabularization, data preparation with aggregation, and time windowing. Check out below for an overview of MEDS-Tab and how it could be useful in your workflows!
+ MEDS-Tab revolutionizes medical data handling by automating tabularization, data preparation, and time windowing. Discover how MEDS-Tab can transform your medical data workflows below!

23-24: Enhance the bullet points for clarity and specificity.

The bullet points under "Why MEDS-Tab?" could benefit from more specific examples or metrics to substantiate the claims of cost efficiency and performance.
[EXISTING_COMMENT]

- - Cost Efficiency: MEDS-Tab is dramatically more cost-effective compared to existing solutions
- - Strong Performance: MEDS-Tab provides robustness and high performance across various datasets compared with other frameworks.
+ - Cost Efficiency: MEDS-Tab reduces operational costs by up to 50% compared to traditional data handling solutions.
+ - Strong Performance: In benchmarks, MEDS-Tab consistently outperforms other frameworks, enhancing predictive accuracy by 20%.

28-28: Streamline the wording for clarity.

The phrase "In order to" can be simplified to make the sentence more direct and concise.
[EXISTING_COMMENT]

- In order to use MEDS-Tab, you will first need to transform your raw EHR data to a MEDS format, which can be done using the following libraries:
+ To use MEDS-Tab, first transform your raw EHR data to a MEDS format using the following libraries:
Tools
LanguageTool

[style] ~28-~28: Consider a shorter alternative to avoid wordiness.
Context: ...ed across arbitrary tasks and settings. In order to use MEDS-Tab, you will first need to tr...

(IN_ORDER_TO_PREMIUM)


39-39: Improve the sentence structure for enhanced readability.

The phrase "not only... but also" can be simplified to improve the flow of the text.
[EXISTING_COMMENT]

- This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning tasks for health projects.
+ This approach simplifies the process and ensures high-quality, reproducible results for your machine learning tasks in health projects.
Tools
LanguageTool

[style] ~39-~39: Opting for a less wordy alternative here can improve the clarity of your writing.
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...

(NOT_ONLY_ALSO)


[style] ~39-~39: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 2513 characters long)
Context: ... datasets in reasonable raw formulations!

(EN_EXCESSIVE_EXCLAMATION)

docs/source/implementation.md (4)

1-3: Consider adding a comma for better readability.

The introductory sentence could benefit from a comma after "In this section" to improve the flow and readability.
[EXISTING_COMMENT]

- In this section we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.
+ In this section, we describe the MEDS-Tab architecture, specifically some of the pipeline choices we made to reduce memory usage and increase speed during the tabularization process and XGBoost tuning process.

14-14: Consider adding a comma after 'as follows'.

This would clarify the separation between the introductory clause and the detailed description that follows.
[EXISTING_COMMENT]

- This initial stage processes a pre-shareded dataset. We expect a structure as follows where each shard contains a subset of the patients:
+ This initial stage processes a pre-shareded dataset. We expect a structure as follows, where each shard contains a subset of the patients:

71-71: Insert "a" before "Sparse array" to correct the determiner omission.

A determiner appears to be missing. Consider inserting it.
[EXISTING_COMMENT]

- Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.
+ A Sparse array is converted to Coordinate List format and stored as a `.npz` file on disk.

92-92: Consider using a stronger verb choice.

Replacing "speeds up" with "accelerates" might provide a stronger and more formal expression in the documentation.
[EXISTING_COMMENT]

- This reduces the memory footprint and speeds up the training process.
+ This reduces the memory footprint and accelerates the training process.
Tools
LanguageTool

[style] ~92-~92: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

(SPEED_UP_ACCELERATE)

docs/source/overview.md (5)

1-2: Clarify the repository's purpose in the introduction.

The introduction could be expanded to provide more details about the specific capabilities and advantages of the repository.
[EXISTING_COMMENT]


12-12: Convert emphasized text to headings for clarity and consistency.

Using headings instead of bold text for section titles can improve the structure and readability of the document.
[EXISTING_COMMENT]

-**Pip Install**
+### Pip Install

-**Local Install**
+### Local Install
Tools
Markdownlint

12-12: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


18-18: Convert emphasis to heading for clarity and consistency.

The emphasis on "Local Install" should be converted to a heading to maintain consistency and improve document structure.
[EXISTING_COMMENT]

- **Local Install**
+ ## Local Install
Tools
Markdownlint

18-18: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


43-43: Clarify conjunction usage in sentence.

The use of "thus" in the description of meds-tab-tabularize-static might confuse readers as it suggests a cause-effect relationship not clearly stated.
[EXISTING_COMMENT]

- thus rows are duplicated across multiple timestamps for the same patient.
+ Consequently, rows are duplicated across multiple timestamps for the same patient.
Tools
LanguageTool

[uncategorized] ~43-~43: Loose punctuation mark.
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

(UNLIKELY_OPENING_PUNCTUATION)


[typographical] ~43-~43: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

(THUS_SENTENCE)


102-102: Add a comma after introductory phrase for clarity.

The phrase "For example" should be followed by a comma to improve readability.
[EXISTING_COMMENT]

- For example you can directly call **`generate-permutations`**
+ For example, you can directly call **`generate-permutations`**
Tools
LanguageTool

[typographical] ~102-~102: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

(COMMA_FOR_EXAMPLE)

docs/source/conf.py (1)

24-24: Ensure version_match is correctly set.

The environment variable READTHEDOCS_VERSION might not always be set, leading to potential issues. Ensure there is a fallback mechanism.

- version_match = os.environ.get("READTHEDOCS_VERSION")
+ version_match = os.environ.get("READTHEDOCS_VERSION", "dev")

"url": "https://pydata-sphinx-theme.readthedocs.io/en/stable/",
"preferred": true
},
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix JSON syntax error.

The JSON array is not correctly closed. Add a closing bracket to fix the syntax error.

-  ]
+]
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
]
]
Tools
Biome

[error] 12-12: Expected an array, an object, or a literal but instead found ']'.

Expected an array, an object, or a literal here.

(parse)

@@ -1,5 +1,9 @@
import os
import shutil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused import.

The shutil module is imported but not used in the script.

-import shutil
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import shutil

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d1ac6b2 and 0d148d7.

Files selected for processing (2)
  • docs/source/_static/switcher.json (1 hunks)
  • docs/source/conf.py (2 hunks)
Files skipped from review as they are similar to previous changes (1)
  • docs/source/_static/switcher.json
Additional comments not posted (32)
docs/source/conf.py (32)

4-4: Approved: Importing Path from pathlib.

The Path module is used for path operations in the script.


17-18: Approved: Updated project metadata.

The project metadata, including project, copyright, and author, has been updated appropriately.


22-22: Approved: Added json_url variable.

The json_url variable is defined for the version switcher.


24-24: Approved: Added version_match variable.

The version_match variable is defined for version handling.


25-25: Approved: Set release variable.

The release variable is set to the version of MEDS_tabular_automl.


29-41: Approved: Logic for handling version_match.

The script includes logic to handle different values of version_match.


49-49: Approved: Set language to "en".

The language for the documentation is set to English.


52-53: Approved: Defined __location__ and __src__.

The __location__ and __src__ variables are defined using Path.


58-58: Approved: Added source directory to sys.path.

The script adds the source directory to sys.path.


61-76: Approved: Defined ensure_pandoc_installed function.

The function ensure_pandoc_installed ensures Pandoc is installed.


79-96: Approved: Script for running sphinx-apidoc.

The script runs sphinx-apidoc automatically.


108-119: Approved: Added Sphinx extensions.

The list of Sphinx extensions includes several new extensions.


123-123: Approved: Set HTML theme to pydata_sphinx_theme.

The HTML theme is set to pydata_sphinx_theme.


125-129: Approved: Updated sidebar configuration.

The sidebar configuration for the HTML theme is updated.


131-131: Approved: Set nbsphinx_allow_errors to True.

The nbsphinx_allow_errors variable is set to True.


134-136: Approved: Defined and created collections_dir.

The collections_dir is defined and created if it does not exist.


138-138: Approved: Defined python_version variable.

The python_version variable is defined for intersphinx mapping.


140-153: Approved: Updated intersphinx mapping.

The intersphinx mapping includes several new entries.


162-162: Approved: Updated exclude_patterns variable.

The exclude_patterns variable is updated.


165-165: Approved: Set nb_execution_mode to "off".

The nb_execution_mode variable is set to "off".


171-182: Approved: Updated myst_enable_extensions variable.

The myst_enable_extensions variable includes several new extensions.


184-184: Approved: Set myst_update_mathjax to True.

The myst_update_mathjax variable is set to True.


187-199: Approved: Updated myst_url_schemes variable.

The myst_url_schemes variable includes several new URL schemes.


202-202: Approved: Updated source_suffix variable.

The source_suffix variable is updated.


208-208: Approved: Set master_doc to "index".

The master_doc variable is set to "index".


225-225: Approved: Set pygments_style to "tango".

The pygments_style variable is set to "tango".


235-235: Approved: Set todo_emit_warnings to True.

The todo_emit_warnings variable is set to True.


243-244: Approved: Set html_title and html_short_title.

The html_title and html_short_title variables are set.


249-287: Approved: Updated html_theme_options variable.

The html_theme_options variable includes several new options.


289-293: Approved: Defined html_context variable.

The html_context variable is defined.


328-329: Approved: Set html_show_copyright and htmlhelp_basename.

The html_show_copyright and htmlhelp_basename variables are set.


2-2: Remove unused import.

The shutil module is imported but not used in the script.

-import shutil

Likely invalid or redundant comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Outside diff range and nitpick comments (6)
docs/source/profiling.md (5)

7-7: Add a comma for better readability.

The sentence could benefit from a comma after "comparison" to improve the flow and readability.

- Specifically, for comparison we used:
+ Specifically, for comparison, we used:

24-24: Remove "of" for conciseness.

The phrase "in all of the scenarios" can be simplified to "in all scenarios" to enhance readability and conciseness.

- ...emphasizing the better performance of MEDS-Tab in all of the scenarios.
+ ...emphasizing the better performance of MEDS-Tab in all scenarios.
Tools
LanguageTool

[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)


[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)


24-24: Correct hyphenation in adjectival compound.

When a number forms part of an adjectival compound, it should be hyphenated to improve readability.

- Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
+ Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget.
Tools
LanguageTool

[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)


[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)


20-20: Consider adding a comma for better readability.

The sentence could benefit from a comma after "datasets" to improve the flow and readability.

- ...w that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab.
+ ...w that on the MIMIC-IV and eICU medical datasets, we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab.

84-84: Ensure the file ends with a single newline character.

Files should end with a single newline character to adhere to best practices.

+ \n
docs/source/prediction.md (1)

85-85: Specify language for fenced code blocks to adhere to Markdown best practices.

Fenced code blocks should have a language specified.

- ```
+ ```bash
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 0d148d7 and 11e4623.

Files selected for processing (2)
  • docs/source/prediction.md (1 hunks)
  • docs/source/profiling.md (1 hunks)
Additional context used
LanguageTool
docs/source/profiling.md

[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)


[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)

docs/source/prediction.md

[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... 11,830 | #### 1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

(SPECIFIC_HYPHEN)


[uncategorized] ~195-~195: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... | 14 | #### 3. eICU Task Specific Training Cohort Size | Task ...

(SPECIFIC_HYPHEN)

Markdownlint
docs/source/prediction.md

121-121: null
Spaces inside code span elements

(MD038, no-space-in-code)

tabularization.min_code_inclusion_frequency: tag(log, range(10, 1000000))
```

Note that the XGBoost command shown includes `tabularization.window_sizes` and ` tabularization.aggs` in the parameters to sweep over.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove spaces inside code span elements.

Spaces inside code span elements should be removed to adhere to best practices.

- ` tabularization.aggs`
+ `tabularization.aggs`
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Note that the XGBoost command shown includes `tabularization.window_sizes` and ` tabularization.aggs` in the parameters to sweep over.
Note that the XGBoost command shown includes `tabularization.window_sizes` and `tabularization.aggs` in the parameters to sweep over.
Tools
Markdownlint

121-121: null
Spaces inside code span elements

(MD038, no-space-in-code)

| LOS in Hospital > 3 days | Admission + 24 hr | 6m5s | 7m5s | 1m4s | 11,012 | 12,223 |
| LOS in Hospital > 3 days | Admission + 48 hr | 6m10s | 7m12s | 1m4s | 10,703 | 11,830 |

#### 1.2 MIMIC-IV Task Specific Training Cohort Size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hyphenate "Task-Specific" for grammatical correctness.

When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.

- #### 1.1.2 MIMIC-IV Task Specific Training Cohort Size
+ #### 1.1.2 MIMIC-IV Task-Specific Training Cohort Size
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#### 1.2 MIMIC-IV Task Specific Training Cohort Size
#### 1.2 MIMIC-IV Task-Specific Training Cohort Size
Tools
LanguageTool

[uncategorized] ~63-~63: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... 11,830 | #### 1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

(SPECIFIC_HYPHEN)

| Post-discharge 30 day Mortality | Discharge | 0.003 | 0.0116 | 0.001 | 0.730 | 13 | 986 | 7 | 7 |
| Post-discharge 1 year Mortality | Discharge | 0.005 | 0.006 | 0.002 | 0.690 | 93 | 938 | 6 | 14 |

#### 3. eICU Task Specific Training Cohort Size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hyphenate "Task-Specific" for grammatical correctness.

When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.

- #### 3. eICU Task Specific Training Cohort Size
+ #### 3. eICU Task-Specific Training Cohort Size
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#### 3. eICU Task Specific Training Cohort Size
#### 3. eICU Task-Specific Training Cohort Size
Tools
LanguageTool

[uncategorized] ~195-~195: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... | 14 | #### 3. eICU Task Specific Training Cohort Size | Task ...

(SPECIFIC_HYPHEN)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 11e4623 and 7fe635a.

Files selected for processing (2)
  • README.md (2 hunks)
  • docs/source/profiling.md (1 hunks)
Files not summarized due to errors (1)
  • README.md: Error: Message exceeds token limit
Additional context used
LanguageTool
docs/source/profiling.md

[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)


[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)

README.md

[style] ~61-~61: Consider a shorter alternative to avoid wordiness.
Context: ...ed across arbitrary tasks and settings. In order to use MEDS-Tab, you will first need to tr...

(IN_ORDER_TO_PREMIUM)


[style] ~72-~72: Opting for a less wordy alternative here can improve the clarity of your writing.
Context: ...n the MEDS-Tab ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for ...

(NOT_ONLY_ALSO)


[uncategorized] ~76-~76: Loose punctuation mark.
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~85-~85: Loose punctuation mark.
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

(UNLIKELY_OPENING_PUNCTUATION)


[typographical] ~85-~85: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...o a unique patient_id and timestamp combination, thus rows are duplicated across multiple tim...

(THUS_SENTENCE)


[uncategorized] ~99-~99: Loose punctuation mark.
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~114-~114: Loose punctuation mark.
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

(UNLIKELY_OPENING_PUNCTUATION)


[grammar] ~116-~116: Possible subject-verb agreement error detected.
Context: ...a specific task $TASK and labels that has pulled from [ACES](https://github.com/j...

(PLURAL_THAT_AGREEMENT)


[uncategorized] ~127-~127: Loose punctuation mark.
Context: ...e/max] ``` 5. meds-tab-xgboost: Trains an XGBoost model using user-spec...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~142-~142: Loose punctuation mark.
Context: ... Scripts 1. generate-permutations: Generates and prints a sorted list of a...

(UNLIKELY_OPENING_PUNCTUATION)


[typographical] ~144-~144: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

(COMMA_FOR_EXAMPLE)


[uncategorized] ~167-~167: Possible missing comma found.
Context: .... ## Roadmap MEDS-Tab has several key limitations which we plan to address in future chan...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~177-~177: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ...aggregations and/or window sizes we use prior to passing them into the models as feature...

(EN_WORDINESS_PREMIUM_PRIOR_TO)


[uncategorized] ~220-~220: Possible missing comma found.
Context: ...reded dataset. We expect a structure as follows where each shard contains a subset of t...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~298-~298: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

(SPEED_UP_ACCELERATE)


[uncategorized] ~334-~334: Possible missing comma found.
Context: ... more memory efficient version of their method which we denote catabra-mem. Other li...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~338-~338: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)


[uncategorized] ~338-~338: Possible missing comma found.
Context: ... that catabra-mem is omitted from the tables as it never completed within the 10 min...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~338-~338: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)


[uncategorized] ~462-~462: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... 11,830 | #### 1.2 MIMIC-IV Task Specific Training Cohort Size To better underst...

(SPECIFIC_HYPHEN)


[uncategorized] ~594-~594: When ‘Task-Specific’ is used as a modifier, it is usually spelled with a hyphen.
Context: ... | 14 | #### 3. eICU Task Specific Training Cohort Size | Task ...

(SPECIFIC_HYPHEN)

Markdownlint
README.md

21-21: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)


32-32: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


38-38: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


207-207: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)


315-315: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)


400-400: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)


520-520: null
Spaces inside code span elements

(MD038, no-space-in-code)

Additional comments not posted (4)
docs/source/profiling.md (2)

20-20: Correct the spelling error.

The word "depecited" should be corrected to "depicted."

- Tabularization of time-series data, as depecited above, is commonly used in several past works.
+ Tabularization of time-series data, as depicted above, is commonly used in several past works.

Likely invalid or redundant comment.


24-24: Remove "of" after "all" for conciseness.

The phrase "in all of the scenarios" can be simplified to "in all scenarios" to enhance readability and conciseness.

- ...ing the better performance of MEDS-Tab in all of the scenarios.
+ ...ing the better performance of MEDS-Tab in all scenarios.

Likely invalid or redundant comment.

Tools
LanguageTool

[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)


[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)

README.md (2)

52-58: LGTM!

The "Why MEDS-Tab?" section is clear and informative.


400-400: LGTM!

The "Prediction Performance" section is clear and informative.

Tools
Markdownlint

400-400: null
Multiple top-level headings in the same document

(MD025, single-title, single-h1)


### MEDS-Tab Tabularization Technique

Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comma for better readability.

Consider adding a comma after "method" for better readability.

- ...more memory efficient version of their method which we denote `catabra-mem`.
+ ...more memory efficient version of their method, which we denote `catabra-mem`.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method, which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.


## 2. Comparative Performance Analysis

The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct hyphenation in adjectival compound.

When a number forms part of an adjectival compound, it should be hyphenated to improve readability.

- Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
+ Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget.
Tools
LanguageTool

[style] ~24-~24: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)


[uncategorized] ~24-~24: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)

1. Construction and efficient loading of tabular (flat, non-longitudinal) summary features describing patient records in MEDS over arbitrary time windows (e.g. 1 year, 6 months, etc.), which go backwards in time from a given index date.
2. Running a basic XGBoost AutoML pipeline over these tabular features to predict arbitrary binary classification or regression downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced -- what is more advanced is the efficient construction, storage, and loading of tabular features for the candidate AutoML models, enabling a far more extensive search over a much larger total number of features than prior systems.

## Quick Start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use headings instead of emphasis for installation methods.

Replace emphasis with proper headings to adhere to markdown standards.

- **Pip Install**
+ ### Pip Install

Committable suggestion was skipped due to low confidence.


To use MEDS-Tab, install the dependencies following commands below:

**Pip Install**

```bash
```console
pip install meds-tab
```

**Local Install**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use headings instead of emphasis for installation methods.

Replace emphasis with proper headings to adhere to markdown standards.

- **Local Install**
+ ### Local Install
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Local Install**
### Local Install
Tools
Markdownlint

38-38: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

6. **`meds-tab-xgboost-sweep`**: Conducts an Optuna hyperparameter sweep to optimize over `window_sizes`, `aggregations`, and `min_code_inclusion_frequency`, aiming to enhance model performance and adaptability.

### Additional CLI Scripts
## Additional CLI Scripts

1. **`generate-permutations`**: Generates and prints a sorted list of all permutations from a comma separated input. This is provided for the convenience of sweeping over all possible combinations of window sizes and aggregations.

For example you can directly call **`generate-permutations`** in the command line:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comma after 'For example'.

Improve readability by adding a comma.

- For example you can directly call **`generate-permutations`** in the command line:
+ For example, you can directly call **`generate-permutations`** in the command line:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
For example you can directly call **`generate-permutations`** in the command line:
For example, you can directly call **`generate-permutations`** in the command line:
Tools
LanguageTool

[typographical] ~144-~144: After the expression ‘for example’ a comma is usually used.
Context: ... window sizes and aggregations. For example you can directly call **`generate-permu...

(COMMA_FOR_EXAMPLE)


**Detailed Workflow:**

- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a stronger verb for clarity.

Replace "speeds up" with "accelerates" for stronger wording.

- This reduces the memory footprint and speeds up the training process.
+ This reduces the memory footprint and accelerates the training process.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and accelerates the training process.
Tools
LanguageTool

[style] ~298-~298: Consider using a different verb to strengthen your wording.
Context: .... This reduces the memory footprint and speeds up the training process. - **Use of Sparse...

(SPEED_UP_ACCELERATE)


## 2. Comparative Performance Analysis

The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix grammatical issues and improve clarity.

Address missing commas and improve descriptions for better readability.

- The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
+ The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such a small number of patients (10, 100, and 500 patients), data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all of the scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such small number of patients (10, 100, and 500 patients) data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10 minute budget.
The tables below detail computational resource utilization across two datasets and various patient scales, emphasizing the better performance of MEDS-Tab in all scenarios. The tables are organized by dataset and number of patients. For the analysis, the full window sizes and the aggregation method code_count were used. Additionally, we use a budget of 10 minutes for running our tests given that for such a small number of patients (10, 100, and 500 patients), data should be processed quickly. Note that `catabra-mem` is omitted from the tables as it never completed within the 10-minute budget.
Tools
LanguageTool

[style] ~338-~338: Consider removing “of” to be more concise
Context: ...g the better performance of MEDS-Tab in all of the scenarios. The tables are organized by ...

(ALL_OF_THE)


[uncategorized] ~338-~338: Possible missing comma found.
Context: ... that catabra-mem is omitted from the tables as it never completed within the 10 min...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~338-~338: When a number forms part of an adjectival compound, use a hyphen.
Context: ...tables as it never completed within the 10 minute budget. ### eICU Dataset The only met...

(MISSING_HYPHEN)

downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced --
what is more advanced is the efficient construction, storage, and loading of tabular features for the
candidate AutoML models, enabling a far more extensive search over different featurization strategies.
See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify the example description.

Improve the description for better readability.

- See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.
+ See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline run on synthetic data. This script is a functional test executed with `pytest` to verify the algorithm's correctness.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline being run on synthetic data. This script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.
See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/tests/test_integration.py) for a local example of the end-to-end pipeline run on synthetic data. This script is a functional test executed with `pytest` to verify the algorithm's correctness.

Comment on lines +74 to +83
## Core CLI Scripts Overview

1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories:

- time-series codes (codes with timestamps)
- time-series numerical values (codes with timestamps and numerical values)
- static codes (codes without timestamps)
- static numerical codes (codes without timestamps but with numerical values).

**Caching feature names and frequencies** in a dataset stored in `"path_to_data"`

```
meds-tab-describe MEDS_cohort_dir="path_to_data"
```
This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix grammatical issues and improve clarity.

Address loose punctuation marks and improve descriptions for better readability.

- 1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories:
+ 1. **`meds-tab-describe`**: Computes the frequencies of different code-types in MEDS data shards, categorizing them as:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Core CLI Scripts Overview
1. **`meds-tab-describe`**: This command processes MEDS data shards to compute the frequencies of different code-types. It differentiates codes into the following categories:
- time-series codes (codes with timestamps)
- time-series numerical values (codes with timestamps and numerical values)
- static codes (codes without timestamps)
- static numerical codes (codes without timestamps but with numerical values).
**Caching feature names and frequencies** in a dataset stored in `"path_to_data"`
```
meds-tab-describe MEDS_cohort_dir="path_to_data"
```
This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.
## Core CLI Scripts Overview
1. **`meds-tab-describe`**: Computes the frequencies of different code-types in MEDS data shards, categorizing them as:
- time-series codes (codes with timestamps)
- time-series numerical values (codes with timestamps and numerical values)
- static codes (codes without timestamps)
- static numerical codes (codes without timestamps but with numerical values).
This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.
Tools
LanguageTool

[uncategorized] ~76-~76: Loose punctuation mark.
Context: ...pts Overview 1. meds-tab-describe: This command processes MEDS data shards...

(UNLIKELY_OPENING_PUNCTUATION)


### MEDS-Tab Tabularization Technique

Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix grammatical issues and improve clarity.

Address missing commas and improve descriptions for better readability.

- Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`.
+ Tabularization of time-series data, as depicted above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory-efficient version of their method, which we denote `catabra-mem`.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Tabularization of time-series data, as depecited above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory efficient version of their method which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.
Tabularization of time-series data, as depicted above, is commonly used in several past works. The only two libraries to our knowledge that provide a full tabularization pipeline are `tsfresh` and `catabra`. `catabra` also offers a slower but more memory-efficient version of their method, which we denote `catabra-mem`. Other libraries either provide only rolling window functionalities (`featuretools`) or just pivoting operations (`Temporai`/`Clairvoyance`, `sktime`, `AutoTS`). We provide a significantly faster and more memory efficient method. Our findings show that on the MIMIC-IV and eICU medical datasets we significantly outperform both above-mentioned methods that provide similar functionalities with MEDS-Tab. While `catabra` and `tsfresh` could not even run within a budget of 10 minutes on as low as 10 patient's data for eICU, our method scales to process hundreds of patients with low memory usage under the same time budget. We present the results below.
Tools
LanguageTool

[uncategorized] ~334-~334: Possible missing comma found.
Context: ... more memory efficient version of their method which we denote catabra-mem. Other li...

(AI_HYDRA_LEO_MISSING_COMMA)

@Oufattole Oufattole merged commit 9f4dde8 into main Jul 3, 2024
3 checks passed
@mmcdermott mmcdermott deleted the docs branch August 10, 2024 17:27
@coderabbitai coderabbitai bot mentioned this pull request Nov 6, 2024
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants