Converting Esgpt caching to work for MEDS datasets #1

Oufattole · 2024-05-26T19:53:48Z

Pull Request Summary

This pull request details the modifications and enhancements made to the data processing and aggregation pipeline. Below is the list of tasks that have been implemented and those that are still in progress.

Completed Tasks

Split Subjects into Shards: Subjects have been divided into manageable shards for efficient processing.
Store Parameters for the Job: Configuration parameters, including patient partitioning and other settings, are saved into a JSON file. This functionality ensures that the setup for each job, such as the number of patients per sub-shard and minimum code inclusion frequency, is consistently maintained and easily retrievable.
Feature Column Processing (get_flat_rep_feature_cols): This function processes aggregations and selects which columns receive which aggregations. The process is categorized into two main types:
- Static:
  1. Codes without numerical values.
  2. Codes with numerical values.
- Dynamic:
  1. Codes with applied aggregations across all instances.
  2. Numerical values with continuous aggregations.
Static Representations (get_flat_static_rep): Produce static representations of data which do not change over time or with conditions.

Tasks in Progress

Time Series Data Representation (get_flat_ts_rep): This function is intended to iterate through each shard to produce a raw representation of time series data.
Summarize History: This task will apply the necessary aggregations to the historical data for further analysis.

Summary by CodeRabbit

New Features
- Introduced functions to generate summarized representations of data frames based on specified window sizes and aggregations.
- Automated tabularization process for medical data, including identifying columns, tabularizing static data, and summarizing data over windows.
Chores
- Updated Python version setup in GitHub workflows from 3.11 to 3.12.

…abular data

…ts is correct when producing static representations

coderabbitai · 2024-05-26T19:53:55Z

Warning

Rate limit exceeded

@mmcdermott has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 17 minutes and 12 seconds before requesting another review.

How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Commits

Files that changed from the base of the PR and between e7a85ba and c8f4144.

Walkthrough

The recent updates encompass a wide range of enhancements and additions to the MEDS tabular automation system. Key changes include the introduction of new configuration settings, the addition of several Python packages, updates to shell scripts for data processing, and new functions for summarizing and generating features from medical datasets. The updates also include improvements to the testing setup and the README documentation, ensuring a comprehensive and functional pipeline for handling and analyzing medical data.

Changes

Files/Groups	Change Summaries
`configs/tabularize.yaml`	Added new settings, updated existing ones, and removed certain values to enhance configuration flexibility.
`pyproject.toml`	Added several Python packages such as `scipy`, `pandas`, `numba`, and more.
`scripts/hf_cohort_e2e.sh`, `scripts/hf_cohort_shard.sh`	Updated scripts to include new parameters and processing steps for handling medical cohort data.
`scripts/summarize_over_windows.py`	Introduced a function to manage time-series data summarization.
`src/MEDS_tabular_automl/...`	Added and updated multiple functions across several modules for generating static and time-series features, summarizing data, and handling utilities.
`.github/workflows/tests.yaml`	Updated the Python version setup from 3.11 to 3.12.
`.pre-commit-config.yaml`	Added arguments `--in-place` and `--remove-all-unused-imports` to the `autoflake` hook.
`README.md`	Updated to clarify the repository's focus, provide installation and usage instructions, and list future enhancements.
`scripts/identify_columns.py`	Automated the creation of flat representations of medical datasets, including functions for storing configuration parameters and writing summarized dataset representations.
`tests/test_tabularize.py`	Introduced test cases for tabularizing medical data using Polars and Hydra, including caching flat representations and summarizing data.
`hf_cohort/...`	Added new scripts and configuration files for processing medical cohort data, including setting up tasks, extracting data, and defining configuration settings.
`src/MEDS_tabular_automl/file_name.py`	Introduced helper functions for managing file names and paths related to MEDS tabular automl tasks.

Sequence Diagram(s) (Beta)

sequenceDiagram
    participant User
    participant Script
    participant Config
    participant DataProcessor
    participant Summarizer
    participant Storage

    User->>Script: Run hf_cohort_e2e.sh
    Script->>Config: Load configurations
    Script->>DataProcessor: Process medical data
    DataProcessor->>Summarizer: Summarize data over windows
    Summarizer->>Storage: Save summarized data
    Storage-->>User: Data processing complete

Poem

In the land of data, where numbers play,
A rabbit hops in, brightening the day.
With scripts and configs, it weaves a tale,
Of medical insights, where details prevail.
From static to dynamic, features align,
In the world of MEDS, the data shines.
🐇✨📊

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8f89008 and 19f0f4e.

Files selected for processing (6)

configs/tabularize.yaml (1 hunks)
src/MEDS_tabular_automl/generate_static_features.py (1 hunks)
src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
src/MEDS_tabular_automl/tabularize.py (1 hunks)
src/MEDS_tabular_automl/utils.py (1 hunks)
tests/test_tabularize.py (1 hunks)

Additional Context Used

Ruff (8)

tests/test_tabularize.py (8)

5-5: Module level import not at top of file

6-6: Module level import not at top of file

7-7: Module level import not at top of file

8-8: Module level import not at top of file

10-10: Module level import not at top of file

11-11: Module level import not at top of file

12-12: Module level import not at top of file

14-14: Module level import not at top of file

Additional comments not posted (14)

configs/tabularize.yaml (2)

23-24: Added new configuration settings for dynamic and numerical value thresholds.

These settings are crucial for defining the behavior of dynamic and numerical value processing in the dataset. Ensure that these values are well-documented and tested to confirm they meet the intended thresholds.

31-31: Enabled the do_update setting.

This change allows the system to update existing data rather than overwriting it, which can be beneficial for incremental updates. Ensure that this behavior is consistent with the system's requirements and that it doesn't lead to data integrity issues.

src/MEDS_tabular_automl/generate_static_features.py (2)

20-84: Added function _summarize_static_measurements to aggregate static measurements.

This function efficiently aggregates static measurements and pivots them into a tabular format. It's well-documented and uses list comprehensions and filtering effectively. Ensure that the feature columns used are correctly prefixed and that the pivot operation is optimized as suggested in the TODO comment.

87-112: Added function get_flat_static_rep to produce a tabular representation of static data features.

This function wraps the _summarize_static_measurements function and normalizes the data, which is crucial for ensuring data consistency before further processing. The function is well-structured and documented. Ensure that the normalization process handles all edge cases appropriately.

tests/test_tabularize.py (1)

101-138: Added a comprehensive test function test_tabularize for the tabularization process.

This test function is well-structured and covers the creation of directories, storing of data, and the tabularization process. It uses temporary directories to avoid side effects, which is a good practice. Ensure that all edge cases and possible errors during the tabularization process are handled and tested.

src/MEDS_tabular_automl/generate_ts_features.py (3)

19-131: Added function _summarize_dynamic_measurements to generate dynamic measurements.

This function handles the inclusion of subjects dynamically and aggregates measurements based on configuration settings. It's well-documented and uses advanced Polars operations effectively. Ensure that the configuration settings (measurement_configs) are robust and cover all necessary scenarios.

134-214: Added function _summarize_over_window to apply aggregations over a specified window.

This function provides flexibility in handling different window sizes and applies various aggregations based on the window size. It's a complex function that handles many edge cases, and it's well-documented. Ensure that the performance of this function is monitored, especially with large datasets.

217-229: Added function get_flat_ts_rep to produce a raw representation for dynamic data.

This function integrates the dynamic measurements summarization and normalization into a single workflow. It's concise and leverages lazy evaluation effectively. Ensure that the sort and collect operations are necessary and do not introduce performance bottlenecks.

src/MEDS_tabular_automl/utils.py (6)

27-27: Defined DF_T as a new data type for internal dataframes.

Defining a specific type for dataframes (DF_T) improves code readability and maintainability by making the expected data type clear throughout the codebase.

38-51: Added function write_df to write dataframes to disk.

This function provides a robust way to write dataframes to disk with an option to overwrite existing files. It handles file paths and directories effectively and uses the Polars library for writing data. Ensure that error handling is robust, especially for file writing operations.

53-87: Added function get_smallest_valid_uint_type to determine the smallest valid unsigned integer type.

This function is crucial for optimizing memory usage by selecting the smallest possible data type that can still represent the data adequately. It's well-documented and includes comprehensive examples. Ensure that the function is used consistently throughout the codebase to maintain data type optimizations.

108-149: Enhanced _normalize_flat_rep_df_cols to handle data normalization and type setting.

This function ensures that all necessary columns are added to a dataframe and that each column is set to the correct data type. It also handles zero counts, which can be important for certain types of data analysis. Ensure that this function is tested thoroughly to handle all edge cases.

152-213: Added evaluate_code_properties to categorize codes based on their properties.

This function categorizes codes dynamically based on configuration settings and is a key component of the feature generation process. It's well-documented and handles various edge cases. Ensure that the configuration settings are accurate and that the function's output is validated against expected results.

216-269: Added get_code_column and get_flat_rep_feature_cols to generate feature column names.

These functions are essential for generating feature columns based on the code properties. They are well-documented and use the CodeType enum effectively to ensure that the feature columns are generated correctly. Ensure that these functions are integrated properly with the rest of the feature generation pipeline.

src/MEDS_tabular_automl/tabularize.py

- Create individual scripts for: 1. Retrieving column names 2. Generating static representations 3. Generating dynamic representations 4. Summarizing data over windows - Add doctests for retrieving column names and generating static representations - Confirm functionality of the above tests

coderabbitai

Actionable comments posted: 3

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 19f0f4e and fd1731f.

Files selected for processing (10)

.pre-commit-config.yaml (1 hunks)
scripts/identify_columns.py (1 hunks)
scripts/summarize_over_windows.py (1 hunks)
scripts/tabularize_static.py (1 hunks)
scripts/tabularize_ts.py (1 hunks)
src/MEDS_tabular_automl/generate_static_features.py (1 hunks)
src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
src/MEDS_tabular_automl/tabularize.py (1 hunks)
src/MEDS_tabular_automl/utils.py (1 hunks)
tests/test_tabularize.py (1 hunks)

Files skipped from review due to trivial changes (1)

.pre-commit-config.yaml

Files skipped from review as they are similar to previous changes (3)

src/MEDS_tabular_automl/generate_static_features.py
src/MEDS_tabular_automl/generate_ts_features.py
src/MEDS_tabular_automl/tabularize.py

Additional Context Used

Ruff (11)

tests/test_tabularize.py (11)

5-5: Module level import not at top of file

6-6: Module level import not at top of file

7-7: Module level import not at top of file

8-8: Module level import not at top of file

10-10: Module level import not at top of file

11-11: Module level import not at top of file

12-12: Module level import not at top of file

14-14: Module level import not at top of file

15-15: Module level import not at top of file

16-16: Module level import not at top of file

17-17: Module level import not at top of file

Additional comments not posted (3)

scripts/tabularize_ts.py (1)

8-51: Well-documented function setup for tabularize_ts_data.

The function is well-documented with clear descriptions of its purpose and parameters. The use of hydra.main for configuration and setup_environment for environment setup is appropriate.

scripts/tabularize_static.py (1)

19-125: Well-implemented functions for configuration storage and data processing.

Both store_config_yaml and tabularize_static_data are well-documented and handle file operations and data processing effectively. The error handling and configuration management are appropriately implemented.

src/MEDS_tabular_automl/utils.py (1)

21-240: Comprehensive utility functions for data processing and configuration management.

The utility functions in src/MEDS_tabular_automl/utils.py are well-documented and effectively handle data processing, feature column parsing, and configuration management. The error handling and data type management are appropriately implemented.

scripts/summarize_over_windows.py

tests/test_tabularize.py

scripts/identify_columns.py

coderabbitai

Actionable comments posted: 0

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between fd1731f and 63b9ba6.

Files selected for processing (3)

.github/workflows/tests.yaml (1 hunks)
src/MEDS_tabular_automl/tabularize.py (1 hunks)
src/MEDS_tabular_automl/utils.py (1 hunks)

Files skipped from review due to trivial changes (1)

.github/workflows/tests.yaml

Files skipped from review as they are similar to previous changes (2)

src/MEDS_tabular_automl/tabularize.py
src/MEDS_tabular_automl/utils.py

coderabbitai

Actionable comments posted: 1

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 63b9ba6 and 9e9046f.

Files selected for processing (5)

scripts/identify_columns.py (1 hunks)
scripts/tabularize_ts.py (1 hunks)
src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
src/MEDS_tabular_automl/utils.py (1 hunks)
tests/test_tabularize.py (1 hunks)

Files skipped from review as they are similar to previous changes (3)

scripts/identify_columns.py
scripts/tabularize_ts.py
src/MEDS_tabular_automl/utils.py

Additional Context Used

Ruff (11)

tests/test_tabularize.py (11)

5-5: Module level import not at top of file

6-6: Module level import not at top of file

7-7: Module level import not at top of file

8-8: Module level import not at top of file

10-10: Module level import not at top of file

11-11: Module level import not at top of file

12-12: Module level import not at top of file

14-14: Module level import not at top of file

15-15: Module level import not at top of file

16-16: Module level import not at top of file

17-17: Module level import not at top of file

Additional comments not posted (3)

src/MEDS_tabular_automl/generate_ts_features.py (2)
Line range hint 87-162: The implementation of get_flat_ts_rep correctly utilizes dynamic features for time series representation. Good use of filtering and summarization.

23-84: Renaming to _summarize_dynamic_measurements indicates internal use. Ensure all references in the codebase are updated accordingly.

Verification successful

The occurrences of the old function name summarize_dynamic_measurements have been successfully updated to _summarize_dynamic_measurements in the src/MEDS_tabular_automl/generate_ts_features.py file.

def summarize_dynamic_measurements -> def _summarize_dynamic_measurements

Example in the docstring updated.

Return statement updated.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify all function calls to `_summarize_dynamic_measurements` match the new name.

# Test: Search for the old function name usage. Expect: No occurrences of the old name.
rg --type python $'summarize_dynamic_measurements'
Length of output: 80

Script:
#!/bin/bash
# Description: Verify all function calls to `_summarize_dynamic_measurements` match the new name.

# Test: Search for the old function name usage. Expect: No occurrences of the old name.
rg --glob '*.py' 'summarize_dynamic_measurements'
Length of output: 371

Script:
#!/bin/bash
# Description: Update occurrences of the old function name to the new function name.

# Update the function definition
sed -i 's/def summarize_dynamic_measurements/def _summarize_dynamic_measurements/' src/MEDS_tabular_automl/generate_ts_features.py

# Update the example in the docstring
sed -i 's/summarize_dynamic_measurements(ts_columns, df)/_summarize_dynamic_measurements(ts_columns, df)/' src/MEDS_tabular_automl/generate_ts_features.py

# Update the return statement
sed -i 's/return summarize_dynamic_measurements/return _summarize_dynamic_measurements/' src/MEDS_tabular_automl/generate_ts_features.py
Length of output: 428
tests/test_tabularize.py (1)

104-162: The test logic in test_tabularize is well-structured and covers the necessary functionality effectively.

…dexed file types: - Code data: Files with a patient_id and timestamp index, containing columns for each code with binary presence indicators (1 and 0). - Value data: Files similar in structure but containing the numerical values observed for each code.

coderabbitai

Actionable comments posted: 2

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 9e9046f and cd067f8.

Files selected for processing (5)

scripts/identify_columns.py (1 hunks)
scripts/tabularize_ts.py (1 hunks)
src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
src/MEDS_tabular_automl/utils.py (1 hunks)
tests/test_tabularize.py (1 hunks)

Files skipped from review as they are similar to previous changes (3)

scripts/identify_columns.py
scripts/tabularize_ts.py
src/MEDS_tabular_automl/utils.py

Additional Context Used

Ruff (11)

tests/test_tabularize.py (11)

5-5: Module level import not at top of file

6-6: Module level import not at top of file

7-7: Module level import not at top of file

8-8: Module level import not at top of file

10-10: Module level import not at top of file

11-11: Module level import not at top of file

12-12: Module level import not at top of file

14-14: Module level import not at top of file

15-15: Module level import not at top of file

16-16: Module level import not at top of file

17-17: Module level import not at top of file

Additional comments not posted (2)

src/MEDS_tabular_automl/generate_ts_features.py (2)

18-79: The function summarize_dynamic_measurements is well-documented and the example provided is clear and illustrative.

Line range hint 82-162: The function get_flat_ts_rep is well-documented and effectively utilizes the summarize_dynamic_measurements function to process dynamic features.

…n handling - Introduce VALID_AGGREGATIONS to define permissible aggregations. - Implement to generate dynamic column aliases based on window size and aggregation. - Extend for dynamic expression creation based on aggregation type and window size, handling both cumulative and windowed aggregations. - Enhance to apply specified aggregations over defined window sizes, ensuring comprehensive data summarization. - Update to handle multiple dataframes, aggregate data using specified window sizes and aggregations, and ensure inclusion of all specified feature columns, adding missing ones with default values. - Add extensive doctests to ensure accuracy of the summarization functions, demonstrating usage with both code and value data types.

coderabbitai

Actionable comments posted: 4

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between cd067f8 and c8ca3bb.

Files selected for processing (5)

configs/tabularize.yaml (1 hunks)
scripts/summarize_over_windows.py (1 hunks)
src/MEDS_tabular_automl/generate_static_features.py (1 hunks)
src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)
src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)

Files skipped from review as they are similar to previous changes (4)

configs/tabularize.yaml
scripts/summarize_over_windows.py
src/MEDS_tabular_automl/generate_static_features.py
src/MEDS_tabular_automl/generate_ts_features.py

src/MEDS_tabular_automl/generate_summarized_reps.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 0

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c8ca3bb and 720a533.

Files selected for processing (1)

src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

src/MEDS_tabular_automl/generate_summarized_reps.py

mmcdermott

Can you write a brief summary of how these are expected to be used / run in the README?

Also, on all scripts, you need to start with #!/usr/bin/env python and have the if __name__ == "__main__": main() block at the end

mmcdermott · 2024-05-27T12:43:17Z

configs/tabularize.yaml

@@ -8,27 +8,23 @@ window_sizes: ???
 codes: null
 aggs:
  - "code/count"
-  - "code/time_since_last"
-  - "code/time_since_first"
+  - "code/present"


Do we need these in addition to "*/count"? The two should actually be redundant, no?

scripts/identify_columns.py

mmcdermott · 2024-05-27T13:00:18Z

src/MEDS_tabular_automl/generate_static_features.py

+    or simply as present, then performs a pivot to reshape the data for each patient, providing
+    a tabular format where each row represents a patient and each column represents a static feature.
+    """
+    static_present = [c for c in feature_columns if c.startswith("STATIC_") and c.endswith("present")]


anything with a null timestamp should be assumed to be a static code. They won't all start with "STATIC_" (unless you've done some pre-processing here I'm missing.

src/MEDS_tabular_automl/generate_summarized_reps.py

src/MEDS_tabular_automl/generate_ts_features.py

src/MEDS_tabular_automl/tabularize.py

…ressing comments in other files

…toML into esgpt_caching

coderabbitai

Actionable comments posted: 1

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 720a533 and d9ba7e7.

Files selected for processing (7)

README.md (3 hunks)
configs/tabularize.yaml (1 hunks)
scripts/identify_columns.py (1 hunks)
scripts/summarize_over_windows.py (1 hunks)
src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)
src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
tests/test_tabularize.py (1 hunks)

Files skipped from review as they are similar to previous changes (5)

configs/tabularize.yaml
scripts/identify_columns.py
scripts/summarize_over_windows.py
src/MEDS_tabular_automl/generate_summarized_reps.py
src/MEDS_tabular_automl/generate_ts_features.py

Additional Context Used

LanguageTool (2)

README.md (2)

Near line 54: It appears that a hyphen is missing in the plural noun “to-dos”?
Context: ...bularize.yaml`. ## AutoML Pipelines # TODOs 1. Leverage the "event bound aggregati...
Rule ID: TO_DO_HYPHEN

Near line 70: Possible verb agreement error. Did you mean “formats”? (Some collective nouns can be treated as both singular and plural, so ‘format’ is not always incorrect.)
Context: ...n - MEDS_cohort_dir: directory of MEDS format dataset that is ingested. - tabularized...
Rule ID: COLLECTIVE_NOUN_VERB_AGREEMENT_VBP

Ruff (11)

tests/test_tabularize.py (11)

5-5: Module level import not at top of file

6-6: Module level import not at top of file

7-7: Module level import not at top of file

8-8: Module level import not at top of file

10-10: Module level import not at top of file

11-11: Module level import not at top of file

12-12: Module level import not at top of file

14-14: Module level import not at top of file

15-15: Module level import not at top of file

16-16: Module level import not at top of file

17-17: Module level import not at top of file

Markdownlint (5)

README.md (5)

6: Expected: h2; Actual: h4
Heading levels should only increment by one level at a time

27: null
Multiple top-level headings in the same document

31: null
Multiple top-level headings in the same document

54: null
Multiple top-level headings in the same document

68: null
Multiple top-level headings in the same document

Additional comments not posted (6)

README.md (5)

Line range hint 1-25: The project description is clear and effectively sets the stage for understanding the unique challenges and solutions provided by the repository.

28-28: Installation instructions are concise and clear.

32-43: The usage section effectively communicates the core functionalities of the repository.

55-67: The TODOs are well-defined and provide clear directions for future enhancements.

68-90: The configuration arguments are thoroughly documented, enhancing the usability of the repository.

tests/test_tabularize.py (1)

104-179: The test function is well-structured and covers a comprehensive range of scenarios for testing the tabularization process.

coderabbitai

Actionable comments posted: 5

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c225c47 and cb21821.

Files selected for processing (1)

src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)

Additional context used

Ruff

src/MEDS_tabular_automl/generate_summarized_reps.py

6-6: Module level import not at top of file

7-7: Module level import not at top of file

8-8: Module level import not at top of file

9-9: Module level import not at top of file

11-11: Module level import not at top of file

12-12: Module level import not at top of file

Additional comments not posted (3)

src/MEDS_tabular_automl/generate_summarized_reps.py (3)
112-126: Add input validation for window_size.

83-109: Add input validation for window_size.
+ if not isinstance(window_size, str):
+     raise TypeError("window_size must be a string")
Likely invalid or redundant comment.

44-80: Ensure timestamps are sorted before grouping.
- # Assuming 'timestamp' is already sorted; if not, uncomment the next line:
+ df = df.sort_values(by='timestamp')
Likely invalid or redundant comment.

coderabbitai · 2024-06-02T09:02:53Z

src/MEDS_tabular_automl/generate_summarized_reps.py

+def compute_agg(index_df, matrix: sparray, window_size: str, agg: str, num_features: int, use_tqdm=False):
+    """Applies aggreagtion to dataframe.
+
+    Dataframe is expected to only have the relevant columns for aggregating
+    It should have the patient_id and timestamp columns, and then only code columns
+    if agg is a code aggregation or only value columns if it is a value aggreagation.
+
+    Example:
+    >>> from MEDS_tabular_automl.generate_ts_features import get_flat_ts_rep
+    >>> feature_columns = ['A/value/sum', 'A/code/count', 'B/value/sum', 'B/code/count',
+    ...                    "C/value/sum", "C/code/count", "A/static/present"]
+    >>> data = {'patient_id': [1, 1, 1, 2, 2, 2],
+    ...         'code': ['A', 'A', 'B', 'B', 'C', 'C'],
+    ...         'timestamp': ['2021-01-01', '2021-01-01', '2020-01-01', '2021-01-04', None, None],
+    ...         'numerical_value': [1, 2, 2, 2, 3, 4]}
+    >>> df = pl.DataFrame(data).lazy()
+    >>> df = get_flat_ts_rep(feature_columns, df)
+    >>> df
+       patient_id   timestamp  A/value  B/value  C/value  A/code  B/code  C/code
+    0           1  2021-01-01        1        0        0       1       0       0
+    1           1  2021-01-01        2        0        0       1       0       0
+    2           1  2020-01-01        0        2        0       0       1       0
+    3           2  2021-01-04        0        2        0       0       1       0
+    >>> df['timestamp'] = pd.to_datetime(df['timestamp'])
+    >>> df.dtypes
+    patient_id               int64
+    timestamp       datetime64[ns]
+    A/value       Sparse[int64, 0]
+    B/value       Sparse[int64, 0]
+    C/value       Sparse[int64, 0]
+    A/code        Sparse[int64, 0]
+    B/code        Sparse[int64, 0]
+    C/code        Sparse[int64, 0]
+    dtype: object
+    >>> output = compute_agg(df[['patient_id', 'timestamp', 'A/code', 'B/code', 'C/code']],
+    ...  "1d", "code/count")
+    >>> output
+       1d/A/code/count  1d/B/code/count  1d/C/code/count  timestamp  patient_id
+    0                1                0                0 2021-01-01           1
+    1                2                0                0 2021-01-01           1
+    2                0                1                0 2020-01-01           1
+    0                0                1                0 2021-01-04           2
+    >>> output.dtypes
+    1d/A/code/count    Sparse[int64, 0]
+    1d/B/code/count    Sparse[int64, 0]
+    1d/C/code/count    Sparse[int64, 0]
+    timestamp            datetime64[ns]
+    patient_id                    int64
+    dtype: object
+    """
+    group_df = (
+        index_df.with_row_index("index")
+        .group_by(["patient_id", "timestamp"], maintain_order=True)
+        .agg([pl.col("index").min().alias("min_index"), pl.col("index").max().alias("max_index")])
+        .collect()
+    )
+    index_df = group_df.lazy().select(pl.col("patient_id", "timestamp"))
+    windows = group_df.select(pl.col("min_index", "max_index"))
+    logger.info("Step 1.5: Running sparse aggregation.")
+    matrix = aggregate_matrix(windows, matrix, agg, num_features, use_tqdm)
+    logger.info("Step 2: computing rolling windows and aggregating.")
+    windows = get_rolling_window_indicies(index_df, window_size)
+    logger.info("Starting final sparse aggregations.")
+    matrix = aggregate_matrix(windows, matrix, agg, num_features, use_tqdm)
+    return matrix


Simplify the function or add comments to improve readability.

The function compute_agg performs multiple complex operations. Consider breaking it down into smaller functions or adding detailed comments to explain each step.

coderabbitai · 2024-06-02T09:02:53Z

src/MEDS_tabular_automl/generate_summarized_reps.py

+def generate_summary(
+    feature_columns: list[str], index_df: pl.LazyFrame, matrix: sparray, window_size, agg: str, use_tqdm=False
+) -> pl.LazyFrame:
+    """Generate a summary of the data frame for given window sizes and aggregations.
+
+    This function processes a dataframe to apply specified aggregations over defined window sizes.
+    It then joins the resulting frames on 'patient_id' and 'timestamp', and ensures all specified
+    feature columns exist in the final output, adding missing ones with default values.
+
+    Args:
+        feature_columns (list[str]): List of all feature columns that must exist in the final output.
+        df (list[pl.LazyFrame]): The input dataframes to process, expected to be length 2 list with code_df
+            (pivoted shard with binary presence of codes) and value_df (pivoted shard with numerical values
+            for each code).
+        window_sizes (list[str]): List of window sizes to apply for summarization.
+        aggregations (list[str]): List of aggregations to perform within each window size.
+
+    Returns:
+        pl.LazyFrame: A LazyFrame containing the summarized data with all required features present.
+
+    Expect:
+        >>> from datetime import date
+        >>> wide_df = pd.DataFrame({"patient_id": [1, 1, 1, 2],
+        ...     "A/code": [1, 1, 0, 0],
+        ...     "B/code": [0, 0, 1, 1],
+        ...     "A/value": [1, 2, 3, None],
+        ...     "B/value": [None, None, None, 4.0],
+        ...     "timestamp": [date(2021, 1, 1), date(2021, 1, 1),date(2020, 1, 3), date(2021, 1, 4)],
+        ...     })
+        >>> wide_df['timestamp'] = pd.to_datetime(wide_df['timestamp'])
+        >>> for col in ["A/code", "B/code", "A/value", "B/value"]:
+        ...     wide_df[col] = pd.arrays.SparseArray(wide_df[col])
+        >>> feature_columns = ["A/code/count", "B/code/count", "A/value/sum", "B/value/sum"]
+        >>> aggregations = ["code/count", "value/sum"]
+        >>> window_sizes = ["full", "1d"]
+        >>> generate_summary(feature_columns, wide_df, window_sizes, aggregations)[
+        ...    ["1d/A/code/count", "full/B/code/count", "full/B/value/sum"]]
+           1d/A/code/count  full/B/code/count  full/B/value/sum
+        0              NaN                1.0                 0
+        1              NaN                1.0                 0
+        2              NaN                1.0                 0
+        0              NaN                1.0                 0
+        0              NaN                NaN                 0
+        1              NaN                NaN                 0
+        2              NaN                NaN                 0
+        0              NaN                NaN                 0
+        0                0                NaN                 0
+        1              1.0                NaN                 0
+        2              2.0                NaN                 0
+        0                0                NaN                 0
+        0              NaN                NaN                 0
+        1              NaN                NaN                 0
+        2              NaN                NaN                 0
+        0              NaN                NaN                 0
+    """


Ensure all specified feature columns exist in the final output, adding missing ones with default values.

+ # Ensure all specified feature columns are present in the final output + missing_cols = [col for col in feature_columns if col not in df.columns] + if missing_cols: + for col in missing_cols: + df[col] = pd.arrays.SparseArray([0]*len(df), fill_value=0)

Committable suggestion was skipped due low confidence.

coderabbitai · 2024-06-02T09:02:53Z

src/MEDS_tabular_automl/generate_summarized_reps.py

+def sparse_aggregate(sparse_matrix, agg):
+    if agg == "sum":
+        merged_matrix = sparse_matrix.sum(axis=0, dtype=sparse_matrix.dtype)
+    elif agg == "min":
+        merged_matrix = sparse_matrix.min(axis=0)
+    elif agg == "max":
+        merged_matrix = sparse_matrix.max(axis=0)
+    elif agg == "sum_sqd":
+        merged_matrix = sparse_matrix.power(2).sum(axis=0, dtype=sparse_matrix.dtype)
+    elif agg == "count":
+        merged_matrix = sparse_matrix.getnnz(axis=0)
+    else:
+        raise ValueError(f"Aggregation method '{agg}' not implemented.")
+    return merged_matrix


Enhance error handling by providing a list of valid aggregation methods in the error message.

- raise ValueError(f"Aggregation method '{agg}' not implemented.") + raise ValueError(f"Aggregation method '{agg}' not implemented. Valid methods are: sum, min, max, sum_sqd, count.")

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

def sparse_aggregate(sparse_matrix, agg):

if agg == "sum":

merged_matrix = sparse_matrix.sum(axis=0, dtype=sparse_matrix.dtype)

elif agg == "min":

merged_matrix = sparse_matrix.min(axis=0)

elif agg == "max":

merged_matrix = sparse_matrix.max(axis=0)

elif agg == "sum_sqd":

merged_matrix = sparse_matrix.power(2).sum(axis=0, dtype=sparse_matrix.dtype)

elif agg == "count":

merged_matrix = sparse_matrix.getnnz(axis=0)

else:

raise ValueError(f"Aggregation method '{agg}' not implemented.")

return merged_matrix

raise ValueError(f"Aggregation method '{agg}' not implemented. Valid methods are: sum, min, max, sum_sqd, count.")

coderabbitai · 2024-06-02T09:02:53Z

src/MEDS_tabular_automl/generate_summarized_reps.py

+def time_aggd_col_alias_fntr(window_size: str, agg: str) -> Callable[[str], str]:
+    if agg is None:
+        raise ValueError("Aggregation type 'agg' must be provided")
+
+    def f(c: str) -> str:
+        if c in ["patient_id", "timestamp"]:
+            return c
+        else:
+            return "/".join([window_size] + c.split("/") + [agg])
+
+    return f


Ensure proper error handling by replacing assertions with exceptions.

- if agg is None: - raise ValueError("Aggregation type 'agg' must be provided") + if not isinstance(agg, str) or not agg: + raise ValueError("Aggregation type 'agg' must be provided and must be a non-empty string")

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

def time_aggd_col_alias_fntr(window_size: str, agg: str) -> Callable[[str], str]:

if agg is None:

raise ValueError("Aggregation type 'agg' must be provided")

def f(c: str) -> str:

if c in ["patient_id", "timestamp"]:

return c

else:

return "/".join([window_size] + c.split("/") + [agg])

return f

def time_aggd_col_alias_fntr(window_size: str, agg: str) -> Callable[[str], str]:

if not isinstance(agg, str) or not agg:

raise ValueError("Aggregation type 'agg' must be provided and must be a non-empty string")

def f(c: str) -> str:

if c in ["patient_id", "timestamp"]:

return c

else:

return "/".join([window_size] + c.split("/") + [agg])

return f

coderabbitai · 2024-06-02T09:02:53Z

src/MEDS_tabular_automl/generate_summarized_reps.py

+from collections.abc import Callable
+
+import pandas as pd
+
+pd.set_option("compute.use_numba", True)
+import numpy as np
+import polars as pl
+from loguru import logger
+from scipy.sparse import coo_array, csr_array, sparray, vstack
+
+from MEDS_tabular_automl.generate_ts_features import get_feature_names, get_flat_ts_rep
+from MEDS_tabular_automl.utils import CODE_AGGREGATIONS, VALUE_AGGREGATIONS, load_tqdm


Reorganize module-level imports to the top of the file.

+ import numpy as np + import pandas as pd + import polars as pl + from collections.abc import Callable + from loguru import logger + from scipy.sparse import coo_array, csr_array, sparray, vstack + from MEDS_tabular_automl.generate_ts_features import get_feature_names, get_flat_ts_rep + from MEDS_tabular_automl.utils import CODE_AGGREGATIONS, VALUE_AGGREGATIONS, load_tqdm - from collections.abc import Callable - import pandas as pd - pd.set_option("compute.use_numba", True) - import numpy as np - import polars as pl - from loguru import logger - from scipy.sparse import coo_array, csr_array, sparray, vstack - from MEDS_tabular_automl.generate_ts_features import get_feature_names, get_flat_ts_rep - from MEDS_tabular_automl.utils import CODE_AGGREGATIONS, VALUE_AGGREGATIONS, load_tqdm

Committable suggestion was skipped due low confidence.

Tools

Ruff

6-6: Module level import not at top of file

7-7: Module level import not at top of file

8-8: Module level import not at top of file

9-9: Module level import not at top of file

11-11: Module level import not at top of file

12-12: Module level import not at top of file

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between cb21821 and 3a412a0.

Files selected for processing (1)

scripts/summarize_over_windows.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

scripts/summarize_over_windows.py

…xed bugs related to event ids and column ids being incorrect.

…toML into esgpt_caching

coderabbitai

Actionable comments posted: 6

Outside diff range and nitpick comments (2)

hf_cohort/aces_task_extraction.py (1)

36-76: Consider enhancing logging for better traceability of the data processing workflow.

Adding more detailed logging statements can help in debugging and understanding the flow of data, especially in a complex data processing pipeline like this one.

scripts/xgboost_sweep.py (1)

276-284: Clarify the choice between in-memory and iterator-based data structures in the _build method.

Consider adding comments to explain why one might choose to build data structures in memory versus using iterators, as this could help future maintainers understand the design decisions.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 3a412a0 and a4f1843.

Files selected for processing (8)

configs/xgboost_sweep.yaml (1 hunks)
hf_cohort/aces_task_extraction.py (1 hunks)
scripts/tabularize_static.py (1 hunks)
scripts/xgboost_sweep.py (1 hunks)
src/MEDS_tabular_automl/file_name.py (1 hunks)
src/MEDS_tabular_automl/generate_static_features.py (1 hunks)
src/MEDS_tabular_automl/utils.py (1 hunks)
tests/test_tabularize.py (1 hunks)

Files skipped from review as they are similar to previous changes (5)

configs/xgboost_sweep.yaml
scripts/tabularize_static.py
src/MEDS_tabular_automl/file_name.py
src/MEDS_tabular_automl/generate_static_features.py
src/MEDS_tabular_automl/utils.py

Additional context used

Ruff

tests/test_tabularize.py

5-5: Module level import not at top of file

6-6: Module level import not at top of file

7-7: Module level import not at top of file

8-8: Module level import not at top of file

10-10: Module level import not at top of file

11-11: Module level import not at top of file

12-12: Module level import not at top of file

14-14: Module level import not at top of file

15-20: Module level import not at top of file

21-21: Module level import not at top of file

22-22: Module level import not at top of file

23-23: Module level import not at top of file

24-24: Module level import not at top of file

Additional comments not posted (4)

hf_cohort/aces_task_extraction.py (1)

13-20: LGTM! The function get_events_df is well-implemented with clear logic for filtering and processing data frames.
scripts/xgboost_sweep.py (3)
21-48: Refactor Iterator class to improve readability and maintainability.

The Iterator class is quite large and handles multiple responsibilities. Consider breaking it down into smaller classes or functions, each handling a specific part of the data processing, such as loading data, filtering, and iteration management.

270-271: Add error handling and validation checks in train and evaluate methods.
+ # TODO: add in eval, early stopping, etc.
+ # TODO: check for Nan and inf in labels!
These TODOs are crucial for robust model training. Implementing early stopping, evaluation during training, and checks for NaN or infinite values in labels will help prevent overfitting and ensure the model's reliability.

319-346: Ensure consistent logging and error handling in the xgboost function.

Consider adding more detailed logging at each step of the model's configuration, training, and evaluation to help trace the execution flow and identify potential issues. Also, include error handling to manage exceptions that may occur during the model's operation.

coderabbitai · 2024-06-02T11:33:43Z

hf_cohort/aces_task_extraction.py

+def get_unique_time_events_df(events_df: pl.DataFrame):
+    """Updates Events DataFrame to have unique timestamps and sorted by patient_id and timestamp."""
+    assert events_df.select(pl.col("timestamp")).null_count().collect().item() == 0
+    # Check events_df is sorted - so it aligns with the ts_matrix we generate later in the pipeline
+    events_df = (
+        events_df.drop_nulls("timestamp")
+        .select(pl.col(["patient_id", "timestamp"]))
+        .unique(maintain_order=True)
+    )
+    assert events_df.sort(by=["patient_id", "timestamp"]).collect().equals(events_df.collect())
+    return events_df


Consider replacing assertions with explicit error handling for production environments.

Assertions are generally disabled in production Python environments (using -O flag), which could lead to unhandled errors. Consider using conditional checks and raising exceptions instead.

coderabbitai · 2024-06-02T11:33:43Z

tests/test_tabularize.py

+
+        # Store MEDS outputs
+        for split, data in MEDS_OUTPUTS.items():
+            file_path = MEDS_cohort_dir / "final_cohort" / f"{split}.parquet"
+            file_path.parent.mkdir(exist_ok=True)
+            df = pl.read_csv(StringIO(data))
+            df.with_columns(pl.col("timestamp").str.to_datetime("%Y-%m-%dT%H:%M:%S.%f")).write_parquet(
+                file_path
+            )
+
+        # Check the files are not empty
+        meds_files = f_name_resolver.list_meds_files()
+        assert len(meds_files) == 4, "MEDS Data Files Should be 4!"
+        for f in meds_files:
+            assert pl.read_parquet(f).shape[0] > 0, "MEDS Data Tabular Dataframe Should not be Empty!"
+
+        split_json = json.load(StringIO(SPLITS_JSON))
+        splits_fp = MEDS_cohort_dir / "splits.json"
+        json.dump(split_json, splits_fp.open("w"))
+        logger.info("caching flat representation of MEDS data")
+        store_columns(cfg)
+        assert (tabularized_data_dir / "config.yaml").is_file()
+        assert (tabularized_data_dir / "feature_columns.json").is_file()
+        assert (tabularized_data_dir / "feature_freqs.json").is_file()
+
+        feature_columns = json.load(open(f_name_resolver.get_feature_columns_fp()))
+        assert get_feature_names("code/count", feature_columns) == sorted(CODE_COLS)
+        assert get_feature_names("static/present", feature_columns) == sorted(STATIC_PRESENT_COLS)
+        assert get_feature_names("static/first", feature_columns) == sorted(STATIC_FIRST_COLS)
+        for value_agg in VALUE_AGGREGATIONS:
+            assert get_feature_names(value_agg, feature_columns) == sorted(VALUE_COLS)
+
+        # Check Static File Generation
+        tabularize_static_data(cfg)
+        actual_files = [str(Path(*f.parts[-5:])) for f in f_name_resolver.list_static_files()]
+        assert set(actual_files) == set(EXPECTED_STATIC_FILES)
+        # Check the files are not empty
+        for f in f_name_resolver.list_static_files():
+            static_matrix = load_matrix(f)
+            assert static_matrix.shape[0] > 0, "Static Data Tabular Dataframe Should not be Empty!"
+            expected_num_cols = len(get_feature_names(f"static/{f.stem}", feature_columns))
+            logger.info((static_matrix.shape[1], expected_num_cols))
+            logger.info(f_name_resolver.list_static_files())
+            assert static_matrix.shape[1] == expected_num_cols, (
+                f"Static Data Tabular Dataframe Should have {expected_num_cols}"
+                f"Columns but has {static_matrix.shape[1]}!"
+            )
+        static_first_fp = f_name_resolver.get_flat_static_rep("tuning", "0", "static/first")
+        static_present_fp = f_name_resolver.get_flat_static_rep("tuning", "0", "static/present")
+        assert (
+            load_matrix(static_first_fp).shape[0] == load_matrix(static_present_fp).shape[0]
+        ), "static data first and present aggregations have different numbers of rows"
+
+        summarize_ts_data_over_windows(cfg)
+        # confirm summary files exist:
+        output_files = f_name_resolver.list_ts_files()
+        f_name_resolver.list_ts_files()
+        actual_files = [str(Path(*f.parts[-5:])) for f in output_files]
+
+        assert set(actual_files) == set(SUMMARIZE_EXPECTED_FILES)
+        for f in output_files:
+            sparse_array = load_matrix(f)
+            assert sparse_array.shape[0] > 0
+            assert sparse_array.shape[1] > 0
+        ts_code_fp = f_name_resolver.get_flat_ts_rep("tuning", "0", "365d", "code/count")
+        ts_value_fp = f_name_resolver.get_flat_ts_rep("tuning", "0", "365d", "value/sum")
+        assert (
+            load_matrix(ts_code_fp).shape[0] == load_matrix(ts_value_fp).shape[0]
+        ), "time series code and value have different numbers of rows"
+        assert (
+            load_matrix(static_first_fp).shape[0] == load_matrix(ts_value_fp).shape[0]
+        ), "static data and time series have different numbers of rows"
+
+        # Create fake labels
+        for f in f_name_resolver.list_meds_files():
+            df = pl.read_parquet(f)
+            df = get_events_df(df, feature_columns)
+            pseudo_labels = pl.Series(([0, 1] * df.shape[0])[: df.shape[0]])
+            df = df.with_columns(pl.Series(name="label", values=pseudo_labels))
+            df = df.select(pl.col(["patient_id", "timestamp", "label"]))
+            df = df.unique(subset=["patient_id", "timestamp"])
+            df = df.with_row_index("event_id")
+
+            split = f.parent.stem
+            shard_num = f.stem
+            out_f = f_name_resolver.get_label(split, shard_num)
+            out_f.parent.mkdir(parents=True, exist_ok=True)
+            df.write_parquet(out_f)
+        model_dir = Path(d) / "save_model"
+        xgboost_config_kwargs = {
+            "model_dir": str(model_dir.resolve()),
+            "hydra.mode": "MULTIRUN",
+        }
+        xgboost_config_kwargs = {**tabularize_config_kwargs, **xgboost_config_kwargs}
+        with initialize(version_base=None, config_path="../configs/"):  # path to config.yaml
+            overrides = [f"{k}={v}" for k, v in xgboost_config_kwargs.items()]
+            cfg = compose(config_name="xgboost_sweep", overrides=overrides)  # config.yaml
+        xgboost(cfg)
+        output_files = list(model_dir.glob("*/*/*_model.json"))
+        assert len(output_files) == 1


Refactor test_tabularize into smaller, more focused tests.

This function covers multiple aspects of the data processing workflow, including setup, execution, and verification. Splitting this into smaller tests would improve maintainability and make it easier to identify issues with specific parts of the data processing pipeline.

coderabbitai · 2024-06-02T11:33:44Z

tests/test_tabularize.py

+import rootutils
+
+root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True)
+
+import json
+import tempfile
+from io import StringIO
+from pathlib import Path
+
+import polars as pl
+from hydra import compose, initialize
+from loguru import logger
+
+from MEDS_tabular_automl.file_name import FileNameResolver
+from MEDS_tabular_automl.utils import (
+    VALUE_AGGREGATIONS,
+    get_events_df,
+    get_feature_names,
+    load_matrix,
+)
+from scripts.identify_columns import store_columns
+from scripts.summarize_over_windows import summarize_ts_data_over_windows
+from scripts.tabularize_static import tabularize_static_data
+from scripts.xgboost_sweep import xgboost


Reorganize import statements to the top of the file to follow Python's best practices.

+ import json + import tempfile + from io import StringIO + from pathlib import Path + import polars as pl + from hydra import compose, initialize + from loguru import logger + from MEDS_tabular_automl.file_name import FileNameResolver + from MEDS_tabular_automl.utils import VALUE_AGGREGATIONS, get_feature_names, load_matrix + from scripts.identify_columns import store_columns + from scripts.summarize_over_windows import summarize_ts_data_over_windows + from scripts.tabularize_static import tabularize_static_data - import rootutils

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

import rootutils

root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True)

import json

import tempfile

from io import StringIO

from pathlib import Path

import polars as pl

from hydra import compose, initialize

from loguru import logger

from MEDS_tabular_automl.file_name import FileNameResolver

from MEDS_tabular_automl.utils import (

VALUE_AGGREGATIONS,

get_events_df,

get_feature_names,

load_matrix,

)

from scripts.identify_columns import store_columns

from scripts.summarize_over_windows import summarize_ts_data_over_windows

from scripts.tabularize_static import tabularize_static_data

from scripts.xgboost_sweep import xgboost

import json

import tempfile

from io import StringIO

from pathlib import Path

import polars as pl

from hydra import compose, initialize

from loguru import logger

from MEDS_tabular_automl.file_name import FileNameResolver

from MEDS_tabular_automl.utils import VALUE_AGGREGATIONS, get_feature_names, load_matrix

from scripts.identify_columns import store_columns

from scripts.summarize_over_windows import summarize_ts_data_over_windows

from scripts.tabularize_static import tabularize_static_data

import rootutils

root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True)

Tools

Ruff

5-5: Module level import not at top of file

6-6: Module level import not at top of file

7-7: Module level import not at top of file

8-8: Module level import not at top of file

10-10: Module level import not at top of file

11-11: Module level import not at top of file

12-12: Module level import not at top of file

14-14: Module level import not at top of file

15-20: Module level import not at top of file

21-21: Module level import not at top of file

22-22: Module level import not at top of file

23-23: Module level import not at top of file

24-24: Module level import not at top of file

coderabbitai · 2024-06-02T11:33:44Z

scripts/xgboost_sweep.py

+@hydra.main(version_base=None, config_path="../configs", config_name="xgboost_sweep")
+def xgboost(cfg: DictConfig) -> float:
+    """Optimize the model based on the provided configuration.
+
+    Args:
+    - cfg (DictConfig): Configuration dictionary.
+
+    Returns:
+    - float: Evaluation result.
+    """
+    logger.debug("Initializing XGBoost model")
+    model = XGBoostModel(cfg)
+    logger.debug("Training XGBoost model")
+    time = datetime.now()
+    model.train()
+    logger.debug(f"Training took {datetime.now() - time}")
+    # save model
+    save_dir = (
+        Path(cfg.model_dir)
+        / "_".join(map(str, cfg.window_sizes))
+        / "_".join([agg.replace("/", "") for agg in cfg.aggs])
+    )
+    save_dir.mkdir(parents=True, exist_ok=True)
+
+    model.model.save_model(save_dir / f"{np.random.randint(100000, 999999)}_model.json")
+
+    return model.evaluate()
+


Enhance the xgboost function with detailed logging and error handling.

Consider adding more detailed logging at each step of the model's configuration, training, and evaluation to help trace the execution flow and identify potential issues. Also, include error handling to manage exceptions that may occur during the model's operation.

Enhance the xgboost function with detailed logging and error handling.

Consider adding more detailed logging at each step of the model's configuration, training, and evaluation to help trace the execution flow and identify potential issues. Also, include error handling to manage exceptions that may occur during the model's operation.

coderabbitai · 2024-06-02T11:33:44Z

scripts/xgboost_sweep.py

+    def evaluate(self) -> float:
+        """Evaluate the model on the test set.
+
+        Returns:
+        - float: Evaluation metric (mae).
+        """
+        # TODO: Figure out exactly what we want to do here
+
+        y_pred = self.model.predict(self.dtest)
+        y_true = self.dtest.get_label()
+        return mean_absolute_error(y_true, y_pred)
+


Improve robustness of the evaluate method with error handling and detailed logging.

Consider adding try-except blocks to handle potential exceptions during model evaluation. Additionally, include detailed logging to trace the computation of the evaluation metric.

coderabbitai

Actionable comments posted: 1

Outside diff range and nitpick comments (1)

src/MEDS_tabular_automl/generate_summarized_reps.py (1)

83-109: Consider enhancing the documentation to clarify the operations performed during the rolling window aggregation.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between a4f1843 and 820e194.

Files selected for processing (1)

src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)

Additional context used

Ruff

src/MEDS_tabular_automl/generate_summarized_reps.py

6-6: Module level import not at top of file

7-7: Module level import not at top of file

8-8: Module level import not at top of file

9-9: Module level import not at top of file

11-11: Module level import not at top of file

12-12: Module level import not at top of file

Additional comments not posted (8)

src/MEDS_tabular_automl/generate_summarized_reps.py (8)

15-25: LGTM! The function correctly handles the case where agg is None and appropriately transforms column names.

28-41: LGTM! The function handles various aggregation methods and maintains type consistency where needed.

112-124: LGTM! The function correctly calculates indices for rolling windows and handles special cases appropriately.

127-160: LGTM! The function efficiently aggregates the matrix based on window indices and optionally displays progress.

163-227: LGTM! The function correctly applies aggregation to the dataframe and includes comprehensive documentation with examples.

230-278: LGTM! The function validates the aggregation type and processes the data frame to generate a summary.

281-349: LGTM! The function processes the data frame to apply specified aggregations and ensures all required feature columns are present in the final output.

351-375: LGTM! The main block correctly sets up the environment and processes the data using the defined functions.

coderabbitai · 2024-06-02T11:42:10Z

src/MEDS_tabular_automl/generate_summarized_reps.py

+def sum_merge_timestamps(df, sparse_matrix, agg):
+    """Groups by timestamp and combines rows that are on the same date.
+
+    The combining is done by summing the rows in the sparse matrix that correspond to the same date.
+
+    Args:
+        df (DataFrame): The DataFrame with 'timestamp' and 'patient_id'.
+        sparse_matrix (csr_matrix): The corresponding sparse matrix with data.
+        agg (str): Aggregation method, currently only 'sum' is implemented.
+
+    Returns:
+        DataFrame, csr_matrix: Tuple containing the DataFrame with aggregated timestamps and the corresponding
+            sparse matrix.
+    """
+    # Assuming 'timestamp' is already sorted; if not, uncomment the next line:
+    # df = df.sort_values(by='timestamp')
+
+    # Group by timestamp and sum the data
+    grouped = df.groupby("timestamp")
+    indices = grouped.indices
+
+    # Create a new sparse matrix with summed rows per unique timestamp
+    patient_id = df["patient_id"].iloc[0]
+    timestamps = []
+    output_matrix = csr_array((0, sparse_matrix.shape[1]), dtype=sparse_matrix.dtype)
+
+    # Loop through each group and sum
+    for timestamp, rows in indices.items():
+        # Combine the rows in the sparse matrix for the current group (respecting the aggregation being used)
+        merged_matrix = sparse_aggregate(sparse_matrix[rows], agg)
+        # Save the non-zero elements
+        output_matrix = vstack([output_matrix, merged_matrix])
+        timestamps.extend([timestamp])
+
+    # Create output DataFrame
+    out_df = pd.DataFrame({"patient_id": [patient_id] * len(timestamps), "timestamp": timestamps})
+    return out_df, output_matrix


Consider ensuring timestamps are sorted within the function to avoid potential issues if the assumption is not met externally.

- # Assuming 'timestamp' is already sorted; if not, uncomment the next line: + df = df.sort_values(by='timestamp')

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

def sum_merge_timestamps(df, sparse_matrix, agg):

"""Groups by timestamp and combines rows that are on the same date.

The combining is done by summing the rows in the sparse matrix that correspond to the same date.

Args:

df (DataFrame): The DataFrame with 'timestamp' and 'patient_id'.

sparse_matrix (csr_matrix): The corresponding sparse matrix with data.

agg (str): Aggregation method, currently only 'sum' is implemented.

Returns:

DataFrame, csr_matrix: Tuple containing the DataFrame with aggregated timestamps and the corresponding

sparse matrix.

"""

# Assuming 'timestamp' is already sorted; if not, uncomment the next line:

# df = df.sort_values(by='timestamp')

# Group by timestamp and sum the data

grouped = df.groupby("timestamp")

indices = grouped.indices

# Create a new sparse matrix with summed rows per unique timestamp

patient_id = df["patient_id"].iloc[0]

timestamps = []

output_matrix = csr_array((0, sparse_matrix.shape[1]), dtype=sparse_matrix.dtype)

# Loop through each group and sum

for timestamp, rows in indices.items():

# Combine the rows in the sparse matrix for the current group (respecting the aggregation being used)

merged_matrix = sparse_aggregate(sparse_matrix[rows], agg)

# Save the non-zero elements

output_matrix = vstack([output_matrix, merged_matrix])

timestamps.extend([timestamp])

# Create output DataFrame

out_df = pd.DataFrame({"patient_id": [patient_id] * len(timestamps), "timestamp": timestamps})

return out_df, output_matrix

# Ensure 'timestamp' is sorted

df = df.sort_values(by='timestamp')

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 820e194 and 4b0637a.

Files selected for processing (1)

hf_cohort/aces_task_extraction.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

hf_cohort/aces_task_extraction.py

…toML into esgpt_caching

coderabbitai

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 4b0637a and 23877ad.

Files selected for processing (1)

hf_cohort/hf_cohort_e2e.sh (1 hunks)

Additional context used

Shellcheck

hf_cohort/hf_cohort_e2e.sh

[warning] 4-4: N_RUNS appears unused. Verify use (or export if used externally).

[warning] 5-5: OUTPUT_BASE appears unused. Verify use (or export if used externally).

[warning] 29-29: POLARS_MAX_THREADS appears unused. Verify use (or export if used externally).

[info] 52-52: Double quote to prevent globbing and word splitting.

Additional comments not posted (1)

hf_cohort/hf_cohort_e2e.sh (1)
10-10: Ensure N_PARALLEL_WORKERS is properly validated.

It's important to validate the input for N_PARALLEL_WORKERS to ensure it's a positive integer. This prevents potential runtime errors or unexpected behaviors.

Verification successful

The validation for N_PARALLEL_WORKERS is correctly implemented and ensures it is a positive integer.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that N_PARALLEL_WORKERS is a positive integer.
if ! [[ "$1" =~ ^[0-9]+$ ]] ; then
   echo "Error: N_PARALLEL_WORKERS is not a positive integer."
   exit 1
fi
Length of output: 146

coderabbitai · 2024-06-02T14:11:03Z

hf_cohort/hf_cohort_e2e.sh

+    echo "build_dataset.sh failed with status $cmd_exit_status."
+    echo "Stderr from build_dataset.sh (see $LOG_DIR/cmd.stderr):"
+    tail $LOG_DIR/cmd.stderr
+    exit $cmd_exit_status


Correct the shell syntax for robustness.

The exit command should be properly quoted to prevent globbing and word splitting, ensuring the script's robustness and security.

- exit $cmd_exit_status + exit "$cmd_exit_status"

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

exit $cmd_exit_status

exit "$cmd_exit_status"

Tools

Shellcheck

[info] 52-52: Double quote to prevent globbing and word splitting.

coderabbitai · 2024-06-02T14:11:03Z

hf_cohort/hf_cohort_e2e.sh

+echo "Running identify_columns.py: Caching feature names and frequencies."
+rm -rf $OUTPUT_DIR
+POLARS_MAX_THREADS=32 python scripts/identify_columns.py \
+    MEDS_cohort_dir=$MEDS_DIR \
+    tabularized_data_dir=$OUTPUT_DIR \
+    min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"


Consider adding error handling for the rm and python commands.

The script lacks error handling for critical operations like removing directories and running Python scripts. Adding error checks will improve the robustness of the script.

+ if ! rm -rf $OUTPUT_DIR; then + echo "Failed to remove $OUTPUT_DIR" + exit 1 + fi + if ! POLARS_MAX_THREADS=32 python scripts/identify_columns.py \ + MEDS_cohort_dir=$MEDS_DIR \ + tabularized_data_dir=$OUTPUT_DIR \ + min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"; then + echo "Python script failed" + exit 1 + fi

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

echo "Running identify_columns.py: Caching feature names and frequencies."

rm -rf $OUTPUT_DIR

POLARS_MAX_THREADS=32 python scripts/identify_columns.py \

MEDS_cohort_dir=$MEDS_DIR \

tabularized_data_dir=$OUTPUT_DIR \

min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"

echo "Running identify_columns.py: Caching feature names and frequencies."

if ! rm -rf $OUTPUT_DIR; then

echo "Failed to remove $OUTPUT_DIR"

exit 1

fi

if ! POLARS_MAX_THREADS=32 python scripts/identify_columns.py \

MEDS_cohort_dir=$MEDS_DIR \

tabularized_data_dir=$OUTPUT_DIR \

min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"; then

echo "Python script failed"

exit 1

fi

coderabbitai · 2024-06-02T14:11:04Z

hf_cohort/hf_cohort_e2e.sh

+WINDOW_SIZES="window_sizes=[1d,7d,30d,365d,full]"
+AGGS="aggs=[static/present,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]"


Tip
Codebase Verification

The variables WINDOW_SIZES and AGGS are defined in the script hf_cohort/hf_cohort_e2e.sh, but they appear to be empty when the script is executed. This suggests that the variables might not be properly exported or sourced before the format check.

Ensure that WINDOW_SIZES and AGGS are correctly set and exported before the format check in the script.

Analysis chain

Review the aggregation and window size parameters.

The parameters for window sizes and aggregations are critical for the script's operation. Ensure these are correctly formatted and match the expected patterns used in downstream processing.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Verify the format of WINDOW_SIZES and AGGS. if ! [[ "$WINDOW_SIZES" =~ ^window_sizes=\[.*\]$ ]] ; then echo "Error: WINDOW_SIZES format is incorrect." exit 1 fi if ! [[ "$AGGS" =~ ^aggs=\[.*\]$ ]] ; then echo "Error: AGGS format is incorrect." exit 1 fi

Length of output: 134

Script:

#!/bin/bash # Description: Print the values of WINDOW_SIZES and AGGS to verify their correctness. echo "WINDOW_SIZES: $WINDOW_SIZES" echo "AGGS: $AGGS"

Length of output: 61

Script:

#!/bin/bash # Description: Search for the assignments of WINDOW_SIZES and AGGS in the codebase. rg 'WINDOW_SIZES=' rg 'AGGS='

Length of output: 560

coderabbitai · 2024-06-02T14:11:04Z

hf_cohort/hf_cohort_e2e.sh

+    tabularized_data_dir=$OUTPUT_DIR \
+    min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"
+
+POLARS_MAX_THREADS=1


Tip
Codebase Verification

The variable POLARS_MAX_THREADS is set to 1 in hf_cohort/hf_cohort_e2e.sh but is not used afterward in the script. It is used with a value of 32 in other parts of the same script and in hf_cohort/hf_cohort_shard.sh.

hf_cohort/hf_cohort_e2e.sh: POLARS_MAX_THREADS=32 is used with Python scripts.

hf_cohort/hf_cohort_shard.sh: POLARS_MAX_THREADS=32 is used with Python scripts.

However, the specific instance of POLARS_MAX_THREADS=1 does not appear to be utilized.

hf_cohort/hf_cohort_e2e.sh:29: POLARS_MAX_THREADS=1 is set but not used.

Analysis chain

Verify the usage of POLARS_MAX_THREADS.

The variable POLARS_MAX_THREADS is set to 1 without subsequent usage, which might be an oversight or error. Confirm its intended use.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Check if POLARS_MAX_THREADS is used after being set. grep -q 'POLARS_MAX_THREADS' hf_cohort/hf_cohort_e2e.sh if [ $? -ne 0 ]; then echo "POLARS_MAX_THREADS set but not used." fi

Length of output: 73

Script:

#!/bin/bash # Description: Check if POLARS_MAX_THREADS is used anywhere in the repository after being set. grep -r 'POLARS_MAX_THREADS' .

Length of output: 905

Tools

Shellcheck

[warning] 29-29: POLARS_MAX_THREADS appears unused. Verify use (or export if used externally).

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 23877ad and 36f54a3.

Files selected for processing (1)

.github/workflows/tests.yaml (2 hunks)

Files skipped from review as they are similar to previous changes (1)

.github/workflows/tests.yaml

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 36f54a3 and e7a85ba.

Files selected for processing (1)

.github/workflows/tests.yaml (1 hunks)

Files skipped from review as they are similar to previous changes (1)

.github/workflows/tests.yaml

…perties, but don't know what the desired behavior is for the error

…ected.

Some fixes for pre-commit, testing, etc. NOT READY FOR MERGING YET

Oufattole added 3 commits May 25, 2024 08:06

added test case and initial testing and and code for shardingoutput t…

d5ff0df

…abular data

added static feature pivoting

4a486aa

added docstrings and a smiple test case checking the number of subjec…

19f0f4e

…ts is correct when producing static representations

Oufattole requested a review from mmcdermott May 26, 2024 19:54

coderabbitai bot reviewed May 26, 2024

View reviewed changes

src/MEDS_tabular_automl/tabularize.py Outdated Show resolved Hide resolved

src/MEDS_tabular_automl/tabularize.py Outdated Show resolved Hide resolved

src/MEDS_tabular_automl/tabularize.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed May 27, 2024

View reviewed changes

scripts/summarize_over_windows.py Outdated Show resolved Hide resolved

tests/test_tabularize.py Outdated Show resolved Hide resolved

scripts/identify_columns.py Outdated Show resolved Hide resolved

fixed doctests and updated github workflow tests to use python 3.12

63b9ba6

coderabbitai bot reviewed May 27, 2024

View reviewed changes

Oufattole force-pushed the esgpt_caching branch from 9e9046f to cd067f8 Compare May 27, 2024 05:12

coderabbitai bot reviewed May 27, 2024

View reviewed changes

mmcdermott and others added 3 commits May 27, 2024 09:01

Update src/MEDS_tabular_automl/generate_summarized_reps.py

7fdc37d

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update src/MEDS_tabular_automl/generate_summarized_reps.py

4dd3cad

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update src/MEDS_tabular_automl/generate_summarized_reps.py

720a533

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai bot reviewed May 27, 2024

View reviewed changes

Repository owner deleted a comment from coderabbitai bot May 27, 2024

mmcdermott reviewed May 27, 2024

View reviewed changes

Oufattole added 2 commits May 27, 2024 16:03

Added doctest and updated docstrings in identiy_columns.py. [WIP] add…

548e29a

…ressing comments in other files

Merge branch 'esgpt_caching' of github.com:mmcdermott/MEDS_Tabular_Au…

d9ba7e7

…toML into esgpt_caching

coderabbitai bot reviewed May 27, 2024

View reviewed changes

teyaberg and others added 5 commits May 28, 2024 15:55

working on xgboost

f0b1cbb

current state

ba954ef

Removed tqdm, fixed deprecated groupbys, fixed doctest long-line issue.

df2750a

Fixed one of the summary doctests.

4bbbc20

updates based on formats... still many to dos

d39bf1a

coderabbitai bot reviewed Jun 2, 2024

View reviewed changes

removed overwrite killing of jobs which causes errors in multirun

3a412a0

coderabbitai bot reviewed Jun 2, 2024

View reviewed changes

Oufattole and others added 3 commits June 2, 2024 11:27

Xgboost is able to load all concatenated windows and aggregations. Fi…

a4f1843

…xed bugs related to event ids and column ids being incorrect.

fixed timedelta overflow bug

800ab7e

Merge branch 'esgpt_caching' of github.com:mmcdermott/MEDS_Tabular_Au…

820e194

…toML into esgpt_caching

coderabbitai bot reviewed Jun 2, 2024

View reviewed changes

fixed bug with loading feature columns json for aces task script

4b0637a

coderabbitai bot reviewed Jun 2, 2024

View reviewed changes

Oufattole added 2 commits June 2, 2024 14:07

added memory profiling to hf_cohort e2e script

127d04a

Merge branch 'esgpt_caching' of github.com:mmcdermott/MEDS_Tabular_Au…

23877ad

…toML into esgpt_caching

coderabbitai bot reviewed Jun 2, 2024

View reviewed changes

mmcdermott mentioned this pull request Jun 2, 2024

AutoML Code #3

Closed

Made tests ignore the hf_cohort directory

36f54a3

coderabbitai bot reviewed Jun 2, 2024

View reviewed changes

mmcdermott added 4 commits June 2, 2024 10:58

Pre-commit fixes

81bf2d9

Resolving deprecation warnings

83c4eec

Fixed test installation instructions.

e7a85ba

Merge branch 'esgpt_caching' into mmd_changes

35acb97

coderabbitai bot reviewed Jun 2, 2024

View reviewed changes

mmcdermott added 3 commits June 2, 2024 11:09

Resolved one error (or, rather, shifted it) by making some things pro…

bef63b6

…perties, but don't know what the desired behavior is for the error

Shifted more test errors around, but the failures are deeper than exp…

e9775e2

…ected.

Merge pull request #4 from mmcdermott/mmd_changes

c8f4144

Some fixes for pre-commit, testing, etc. NOT READY FOR MERGING YET

Oufattole merged commit f7605b5 into main Jun 5, 2024
0 of 2 checks passed

mmcdermott deleted the esgpt_caching branch June 12, 2024 14:08

coderabbitai bot mentioned this pull request Sep 8, 2024

added autogluon support, more models, more preprocessing strategies #81

Merged

This was referenced Oct 22, 2024

Documentation Overhaul for meds-tab #96

Merged

Adding updated documentation #97

Merged

coderabbitai bot mentioned this pull request Nov 6, 2024

Dev #100

Merged

		WINDOW_SIZES="window_sizes=[1d,7d,30d,365d,full]"
		AGGS="aggs=[static/present,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]"

Converting Esgpt caching to work for MEDS datasets #1

Converting Esgpt caching to work for MEDS datasets #1

Conversation

Oufattole commented May 26, 2024 • edited by coderabbitai bot Loading

Pull Request Summary

Completed Tasks

Tasks in Progress

Summary by CodeRabbit

coderabbitai bot commented May 26, 2024 • edited Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s) (Beta)

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

mmcdermott left a comment

Choose a reason for hiding this comment

mmcdermott May 27, 2024

Choose a reason for hiding this comment

mmcdermott May 27, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 2, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 2, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 2, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 2, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 2, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 2, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 2, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 2, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 2, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 2, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 2, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 2, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 2, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 2, 2024

Choose a reason for hiding this comment

Oufattole commented May 26, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented May 26, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)