Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting Esgpt caching to work for MEDS datasets #1

Merged
merged 75 commits into from
Jun 5, 2024
Merged

Conversation

Oufattole
Copy link
Collaborator

@Oufattole Oufattole commented May 26, 2024

Pull Request Summary

This pull request details the modifications and enhancements made to the data processing and aggregation pipeline. Below is the list of tasks that have been implemented and those that are still in progress.

Completed Tasks

  • Split Subjects into Shards: Subjects have been divided into manageable shards for efficient processing.
  • Store Parameters for the Job: Configuration parameters, including patient partitioning and other settings, are saved into a JSON file. This functionality ensures that the setup for each job, such as the number of patients per sub-shard and minimum code inclusion frequency, is consistently maintained and easily retrievable.
  • Feature Column Processing (get_flat_rep_feature_cols): This function processes aggregations and selects which columns receive which aggregations. The process is categorized into two main types:
    • Static:
      1. Codes without numerical values.
      2. Codes with numerical values.
    • Dynamic:
      1. Codes with applied aggregations across all instances.
      2. Numerical values with continuous aggregations.
  • Static Representations (get_flat_static_rep): Produce static representations of data which do not change over time or with conditions.

Tasks in Progress

  • Time Series Data Representation (get_flat_ts_rep): This function is intended to iterate through each shard to produce a raw representation of time series data.
  • Summarize History: This task will apply the necessary aggregations to the historical data for further analysis.

Summary by CodeRabbit

  • New Features

    • Introduced functions to generate summarized representations of data frames based on specified window sizes and aggregations.
    • Automated tabularization process for medical data, including identifying columns, tabularizing static data, and summarizing data over windows.
  • Chores

    • Updated Python version setup in GitHub workflows from 3.11 to 3.12.

Copy link
Contributor

coderabbitai bot commented May 26, 2024

Warning

Rate limit exceeded

@mmcdermott has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 17 minutes and 12 seconds before requesting another review.

How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Commits

Files that changed from the base of the PR and between e7a85ba and c8f4144.

Walkthrough

The recent updates encompass a wide range of enhancements and additions to the MEDS tabular automation system. Key changes include the introduction of new configuration settings, the addition of several Python packages, updates to shell scripts for data processing, and new functions for summarizing and generating features from medical datasets. The updates also include improvements to the testing setup and the README documentation, ensuring a comprehensive and functional pipeline for handling and analyzing medical data.

Changes

Files/Groups Change Summaries
configs/tabularize.yaml Added new settings, updated existing ones, and removed certain values to enhance configuration flexibility.
pyproject.toml Added several Python packages such as scipy, pandas, numba, and more.
scripts/hf_cohort_e2e.sh, scripts/hf_cohort_shard.sh Updated scripts to include new parameters and processing steps for handling medical cohort data.
scripts/summarize_over_windows.py Introduced a function to manage time-series data summarization.
src/MEDS_tabular_automl/... Added and updated multiple functions across several modules for generating static and time-series features, summarizing data, and handling utilities.
.github/workflows/tests.yaml Updated the Python version setup from 3.11 to 3.12.
.pre-commit-config.yaml Added arguments --in-place and --remove-all-unused-imports to the autoflake hook.
README.md Updated to clarify the repository's focus, provide installation and usage instructions, and list future enhancements.
scripts/identify_columns.py Automated the creation of flat representations of medical datasets, including functions for storing configuration parameters and writing summarized dataset representations.
tests/test_tabularize.py Introduced test cases for tabularizing medical data using Polars and Hydra, including caching flat representations and summarizing data.
hf_cohort/... Added new scripts and configuration files for processing medical cohort data, including setting up tasks, extracting data, and defining configuration settings.
src/MEDS_tabular_automl/file_name.py Introduced helper functions for managing file names and paths related to MEDS tabular automl tasks.

Sequence Diagram(s) (Beta)

sequenceDiagram
    participant User
    participant Script
    participant Config
    participant DataProcessor
    participant Summarizer
    participant Storage

    User->>Script: Run hf_cohort_e2e.sh
    Script->>Config: Load configurations
    Script->>DataProcessor: Process medical data
    DataProcessor->>Summarizer: Summarize data over windows
    Summarizer->>Storage: Save summarized data
    Storage-->>User: Data processing complete
Loading

Poem

In the land of data, where numbers play,
A rabbit hops in, brightening the day.
With scripts and configs, it weaves a tale,
Of medical insights, where details prevail.
From static to dynamic, features align,
In the world of MEDS, the data shines.
🐇✨📊


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Oufattole Oufattole requested a review from mmcdermott May 26, 2024 19:54
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between 8f89008 and 19f0f4e.
Files selected for processing (6)
  • configs/tabularize.yaml (1 hunks)
  • src/MEDS_tabular_automl/generate_static_features.py (1 hunks)
  • src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
  • src/MEDS_tabular_automl/tabularize.py (1 hunks)
  • src/MEDS_tabular_automl/utils.py (1 hunks)
  • tests/test_tabularize.py (1 hunks)
Additional Context Used
Ruff (8)
tests/test_tabularize.py (8)

5-5: Module level import not at top of file


6-6: Module level import not at top of file


7-7: Module level import not at top of file


8-8: Module level import not at top of file


10-10: Module level import not at top of file


11-11: Module level import not at top of file


12-12: Module level import not at top of file


14-14: Module level import not at top of file

Additional comments not posted (14)
configs/tabularize.yaml (2)

23-24: Added new configuration settings for dynamic and numerical value thresholds.

These settings are crucial for defining the behavior of dynamic and numerical value processing in the dataset. Ensure that these values are well-documented and tested to confirm they meet the intended thresholds.


31-31: Enabled the do_update setting.

This change allows the system to update existing data rather than overwriting it, which can be beneficial for incremental updates. Ensure that this behavior is consistent with the system's requirements and that it doesn't lead to data integrity issues.

src/MEDS_tabular_automl/generate_static_features.py (2)

20-84: Added function _summarize_static_measurements to aggregate static measurements.

This function efficiently aggregates static measurements and pivots them into a tabular format. It's well-documented and uses list comprehensions and filtering effectively. Ensure that the feature columns used are correctly prefixed and that the pivot operation is optimized as suggested in the TODO comment.


87-112: Added function get_flat_static_rep to produce a tabular representation of static data features.

This function wraps the _summarize_static_measurements function and normalizes the data, which is crucial for ensuring data consistency before further processing. The function is well-structured and documented. Ensure that the normalization process handles all edge cases appropriately.

tests/test_tabularize.py (1)

101-138: Added a comprehensive test function test_tabularize for the tabularization process.

This test function is well-structured and covers the creation of directories, storing of data, and the tabularization process. It uses temporary directories to avoid side effects, which is a good practice. Ensure that all edge cases and possible errors during the tabularization process are handled and tested.

src/MEDS_tabular_automl/generate_ts_features.py (3)

19-131: Added function _summarize_dynamic_measurements to generate dynamic measurements.

This function handles the inclusion of subjects dynamically and aggregates measurements based on configuration settings. It's well-documented and uses advanced Polars operations effectively. Ensure that the configuration settings (measurement_configs) are robust and cover all necessary scenarios.


134-214: Added function _summarize_over_window to apply aggregations over a specified window.

This function provides flexibility in handling different window sizes and applies various aggregations based on the window size. It's a complex function that handles many edge cases, and it's well-documented. Ensure that the performance of this function is monitored, especially with large datasets.


217-229: Added function get_flat_ts_rep to produce a raw representation for dynamic data.

This function integrates the dynamic measurements summarization and normalization into a single workflow. It's concise and leverages lazy evaluation effectively. Ensure that the sort and collect operations are necessary and do not introduce performance bottlenecks.

src/MEDS_tabular_automl/utils.py (6)

27-27: Defined DF_T as a new data type for internal dataframes.

Defining a specific type for dataframes (DF_T) improves code readability and maintainability by making the expected data type clear throughout the codebase.


38-51: Added function write_df to write dataframes to disk.

This function provides a robust way to write dataframes to disk with an option to overwrite existing files. It handles file paths and directories effectively and uses the Polars library for writing data. Ensure that error handling is robust, especially for file writing operations.


53-87: Added function get_smallest_valid_uint_type to determine the smallest valid unsigned integer type.

This function is crucial for optimizing memory usage by selecting the smallest possible data type that can still represent the data adequately. It's well-documented and includes comprehensive examples. Ensure that the function is used consistently throughout the codebase to maintain data type optimizations.


108-149: Enhanced _normalize_flat_rep_df_cols to handle data normalization and type setting.

This function ensures that all necessary columns are added to a dataframe and that each column is set to the correct data type. It also handles zero counts, which can be important for certain types of data analysis. Ensure that this function is tested thoroughly to handle all edge cases.


152-213: Added evaluate_code_properties to categorize codes based on their properties.

This function categorizes codes dynamically based on configuration settings and is a key component of the feature generation process. It's well-documented and handles various edge cases. Ensure that the configuration settings are accurate and that the function's output is validated against expected results.


216-269: Added get_code_column and get_flat_rep_feature_cols to generate feature column names.

These functions are essential for generating feature columns based on the code properties. They are well-documented and use the CodeType enum effectively to ensure that the feature columns are generated correctly. Ensure that these functions are integrated properly with the rest of the feature generation pipeline.

src/MEDS_tabular_automl/tabularize.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/tabularize.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/tabularize.py Outdated Show resolved Hide resolved
- Create individual scripts for:
  1. Retrieving column names
  2. Generating static representations
  3. Generating dynamic representations
  4. Summarizing data over windows
- Add doctests for retrieving column names and generating static representations
- Confirm functionality of the above tests
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between 19f0f4e and fd1731f.
Files selected for processing (10)
  • .pre-commit-config.yaml (1 hunks)
  • scripts/identify_columns.py (1 hunks)
  • scripts/summarize_over_windows.py (1 hunks)
  • scripts/tabularize_static.py (1 hunks)
  • scripts/tabularize_ts.py (1 hunks)
  • src/MEDS_tabular_automl/generate_static_features.py (1 hunks)
  • src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
  • src/MEDS_tabular_automl/tabularize.py (1 hunks)
  • src/MEDS_tabular_automl/utils.py (1 hunks)
  • tests/test_tabularize.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • .pre-commit-config.yaml
Files skipped from review as they are similar to previous changes (3)
  • src/MEDS_tabular_automl/generate_static_features.py
  • src/MEDS_tabular_automl/generate_ts_features.py
  • src/MEDS_tabular_automl/tabularize.py
Additional Context Used
Ruff (11)
tests/test_tabularize.py (11)

5-5: Module level import not at top of file


6-6: Module level import not at top of file


7-7: Module level import not at top of file


8-8: Module level import not at top of file


10-10: Module level import not at top of file


11-11: Module level import not at top of file


12-12: Module level import not at top of file


14-14: Module level import not at top of file


15-15: Module level import not at top of file


16-16: Module level import not at top of file


17-17: Module level import not at top of file

Additional comments not posted (3)
scripts/tabularize_ts.py (1)

8-51: Well-documented function setup for tabularize_ts_data.

The function is well-documented with clear descriptions of its purpose and parameters. The use of hydra.main for configuration and setup_environment for environment setup is appropriate.

scripts/tabularize_static.py (1)

19-125: Well-implemented functions for configuration storage and data processing.

Both store_config_yaml and tabularize_static_data are well-documented and handle file operations and data processing effectively. The error handling and configuration management are appropriately implemented.

src/MEDS_tabular_automl/utils.py (1)

21-240: Comprehensive utility functions for data processing and configuration management.

The utility functions in src/MEDS_tabular_automl/utils.py are well-documented and effectively handle data processing, feature column parsing, and configuration management. The error handling and data type management are appropriately implemented.

scripts/summarize_over_windows.py Outdated Show resolved Hide resolved
tests/test_tabularize.py Outdated Show resolved Hide resolved
scripts/identify_columns.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between fd1731f and 63b9ba6.
Files selected for processing (3)
  • .github/workflows/tests.yaml (1 hunks)
  • src/MEDS_tabular_automl/tabularize.py (1 hunks)
  • src/MEDS_tabular_automl/utils.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • .github/workflows/tests.yaml
Files skipped from review as they are similar to previous changes (2)
  • src/MEDS_tabular_automl/tabularize.py
  • src/MEDS_tabular_automl/utils.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between 63b9ba6 and 9e9046f.
Files selected for processing (5)
  • scripts/identify_columns.py (1 hunks)
  • scripts/tabularize_ts.py (1 hunks)
  • src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
  • src/MEDS_tabular_automl/utils.py (1 hunks)
  • tests/test_tabularize.py (1 hunks)
Files skipped from review as they are similar to previous changes (3)
  • scripts/identify_columns.py
  • scripts/tabularize_ts.py
  • src/MEDS_tabular_automl/utils.py
Additional Context Used
Ruff (11)
tests/test_tabularize.py (11)

5-5: Module level import not at top of file


6-6: Module level import not at top of file


7-7: Module level import not at top of file


8-8: Module level import not at top of file


10-10: Module level import not at top of file


11-11: Module level import not at top of file


12-12: Module level import not at top of file


14-14: Module level import not at top of file


15-15: Module level import not at top of file


16-16: Module level import not at top of file


17-17: Module level import not at top of file

Additional comments not posted (3)
src/MEDS_tabular_automl/generate_ts_features.py (2)

Line range hint 87-162: The implementation of get_flat_ts_rep correctly utilizes dynamic features for time series representation. Good use of filtering and summarization.


23-84: Renaming to _summarize_dynamic_measurements indicates internal use. Ensure all references in the codebase are updated accordingly.

Verification successful

The occurrences of the old function name summarize_dynamic_measurements have been successfully updated to _summarize_dynamic_measurements in the src/MEDS_tabular_automl/generate_ts_features.py file.

  • def summarize_dynamic_measurements -> def _summarize_dynamic_measurements
  • Example in the docstring updated.
  • Return statement updated.
Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify all function calls to `_summarize_dynamic_measurements` match the new name.

# Test: Search for the old function name usage. Expect: No occurrences of the old name.
rg --type python $'summarize_dynamic_measurements'

Length of output: 80



Script:

#!/bin/bash
# Description: Verify all function calls to `_summarize_dynamic_measurements` match the new name.

# Test: Search for the old function name usage. Expect: No occurrences of the old name.
rg --glob '*.py' 'summarize_dynamic_measurements'

Length of output: 371



Script:

#!/bin/bash
# Description: Update occurrences of the old function name to the new function name.

# Update the function definition
sed -i 's/def summarize_dynamic_measurements/def _summarize_dynamic_measurements/' src/MEDS_tabular_automl/generate_ts_features.py

# Update the example in the docstring
sed -i 's/summarize_dynamic_measurements(ts_columns, df)/_summarize_dynamic_measurements(ts_columns, df)/' src/MEDS_tabular_automl/generate_ts_features.py

# Update the return statement
sed -i 's/return summarize_dynamic_measurements/return _summarize_dynamic_measurements/' src/MEDS_tabular_automl/generate_ts_features.py

Length of output: 428

tests/test_tabularize.py (1)

104-162: The test logic in test_tabularize is well-structured and covers the necessary functionality effectively.

…dexed file types:

- Code data: Files with a patient_id and timestamp index, containing columns for each code with binary presence indicators (1 and 0).
- Value data: Files similar in structure but containing the numerical values observed for each code.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between 9e9046f and cd067f8.
Files selected for processing (5)
  • scripts/identify_columns.py (1 hunks)
  • scripts/tabularize_ts.py (1 hunks)
  • src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
  • src/MEDS_tabular_automl/utils.py (1 hunks)
  • tests/test_tabularize.py (1 hunks)
Files skipped from review as they are similar to previous changes (3)
  • scripts/identify_columns.py
  • scripts/tabularize_ts.py
  • src/MEDS_tabular_automl/utils.py
Additional Context Used
Ruff (11)
tests/test_tabularize.py (11)

5-5: Module level import not at top of file


6-6: Module level import not at top of file


7-7: Module level import not at top of file


8-8: Module level import not at top of file


10-10: Module level import not at top of file


11-11: Module level import not at top of file


12-12: Module level import not at top of file


14-14: Module level import not at top of file


15-15: Module level import not at top of file


16-16: Module level import not at top of file


17-17: Module level import not at top of file

Additional comments not posted (2)
src/MEDS_tabular_automl/generate_ts_features.py (2)

18-79: The function summarize_dynamic_measurements is well-documented and the example provided is clear and illustrative.


Line range hint 82-162: The function get_flat_ts_rep is well-documented and effectively utilizes the summarize_dynamic_measurements function to process dynamic features.

…n handling

- Introduce VALID_AGGREGATIONS to define permissible aggregations.
- Implement  to generate dynamic column aliases based on window size and aggregation.
- Extend  for dynamic expression creation based on aggregation type and window size, handling both cumulative and windowed aggregations.
- Enhance  to apply specified aggregations over defined window sizes, ensuring comprehensive data summarization.
- Update  to handle multiple dataframes, aggregate data using specified window sizes and aggregations, and ensure inclusion of all specified feature columns, adding missing ones with default values.
- Add extensive doctests to ensure accuracy of the summarization functions, demonstrating usage with both code and value data types.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between cd067f8 and c8ca3bb.
Files selected for processing (5)
  • configs/tabularize.yaml (1 hunks)
  • scripts/summarize_over_windows.py (1 hunks)
  • src/MEDS_tabular_automl/generate_static_features.py (1 hunks)
  • src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)
  • src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
Files skipped from review as they are similar to previous changes (4)
  • configs/tabularize.yaml
  • scripts/summarize_over_windows.py
  • src/MEDS_tabular_automl/generate_static_features.py
  • src/MEDS_tabular_automl/generate_ts_features.py

src/MEDS_tabular_automl/generate_summarized_reps.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/generate_summarized_reps.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/generate_summarized_reps.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/generate_summarized_reps.py Outdated Show resolved Hide resolved
mmcdermott and others added 3 commits May 27, 2024 09:01
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between c8ca3bb and 720a533.
Files selected for processing (1)
  • src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • src/MEDS_tabular_automl/generate_summarized_reps.py

Repository owner deleted a comment from coderabbitai bot May 27, 2024
Copy link
Owner

@mmcdermott mmcdermott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write a brief summary of how these are expected to be used / run in the README?

Also, on all scripts, you need to start with #!/usr/bin/env python and have the if __name__ == "__main__": main() block at the end

@@ -8,27 +8,23 @@ window_sizes: ???
codes: null
aggs:
- "code/count"
- "code/time_since_last"
- "code/time_since_first"
- "code/present"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these in addition to "*/count"? The two should actually be redundant, no?

scripts/identify_columns.py Show resolved Hide resolved
scripts/identify_columns.py Outdated Show resolved Hide resolved
scripts/identify_columns.py Show resolved Hide resolved
scripts/identify_columns.py Outdated Show resolved Hide resolved
or simply as present, then performs a pivot to reshape the data for each patient, providing
a tabular format where each row represents a patient and each column represents a static feature.
"""
static_present = [c for c in feature_columns if c.startswith("STATIC_") and c.endswith("present")]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anything with a null timestamp should be assumed to be a static code. They won't all start with "STATIC_" (unless you've done some pre-processing here I'm missing.

src/MEDS_tabular_automl/generate_summarized_reps.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/generate_summarized_reps.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/generate_ts_features.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/tabularize.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between 720a533 and d9ba7e7.
Files selected for processing (7)
  • README.md (3 hunks)
  • configs/tabularize.yaml (1 hunks)
  • scripts/identify_columns.py (1 hunks)
  • scripts/summarize_over_windows.py (1 hunks)
  • src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)
  • src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
  • tests/test_tabularize.py (1 hunks)
Files skipped from review as they are similar to previous changes (5)
  • configs/tabularize.yaml
  • scripts/identify_columns.py
  • scripts/summarize_over_windows.py
  • src/MEDS_tabular_automl/generate_summarized_reps.py
  • src/MEDS_tabular_automl/generate_ts_features.py
Additional Context Used
LanguageTool (2)
README.md (2)

Near line 54: It appears that a hyphen is missing in the plural noun “to-dos”?
Context: ...bularize.yaml`. ## AutoML Pipelines # TODOs 1. Leverage the "event bound aggregati...
Rule ID: TO_DO_HYPHEN


Near line 70: Possible verb agreement error. Did you mean “formats”? (Some collective nouns can be treated as both singular and plural, so ‘format’ is not always incorrect.)
Context: ...n - MEDS_cohort_dir: directory of MEDS format dataset that is ingested. - tabularized...
Rule ID: COLLECTIVE_NOUN_VERB_AGREEMENT_VBP

Ruff (11)
tests/test_tabularize.py (11)

5-5: Module level import not at top of file


6-6: Module level import not at top of file


7-7: Module level import not at top of file


8-8: Module level import not at top of file


10-10: Module level import not at top of file


11-11: Module level import not at top of file


12-12: Module level import not at top of file


14-14: Module level import not at top of file


15-15: Module level import not at top of file


16-16: Module level import not at top of file


17-17: Module level import not at top of file

Markdownlint (5)
README.md (5)

6: Expected: h2; Actual: h4
Heading levels should only increment by one level at a time


27: null
Multiple top-level headings in the same document


31: null
Multiple top-level headings in the same document


54: null
Multiple top-level headings in the same document


68: null
Multiple top-level headings in the same document

Additional comments not posted (6)
README.md (5)

Line range hint 1-25: The project description is clear and effectively sets the stage for understanding the unique challenges and solutions provided by the repository.


28-28: Installation instructions are concise and clear.


32-43: The usage section effectively communicates the core functionalities of the repository.


55-67: The TODOs are well-defined and provide clear directions for future enhancements.


68-90: The configuration arguments are thoroughly documented, enhancing the usability of the repository.

tests/test_tabularize.py (1)

104-179: The test function is well-structured and covers a comprehensive range of scenarios for testing the tabularization process.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c225c47 and cb21821.

Files selected for processing (1)
  • src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)
Additional context used
Ruff
src/MEDS_tabular_automl/generate_summarized_reps.py

6-6: Module level import not at top of file


7-7: Module level import not at top of file


8-8: Module level import not at top of file


9-9: Module level import not at top of file


11-11: Module level import not at top of file


12-12: Module level import not at top of file

Additional comments not posted (3)
src/MEDS_tabular_automl/generate_summarized_reps.py (3)

112-126: Add input validation for window_size.


83-109: Add input validation for window_size.

+ if not isinstance(window_size, str):
+     raise TypeError("window_size must be a string")

Likely invalid or redundant comment.


44-80: Ensure timestamps are sorted before grouping.

- # Assuming 'timestamp' is already sorted; if not, uncomment the next line:
+ df = df.sort_values(by='timestamp')

Likely invalid or redundant comment.

Comment on lines 165 to 229
def compute_agg(index_df, matrix: sparray, window_size: str, agg: str, num_features: int, use_tqdm=False):
"""Applies aggreagtion to dataframe.

Dataframe is expected to only have the relevant columns for aggregating
It should have the patient_id and timestamp columns, and then only code columns
if agg is a code aggregation or only value columns if it is a value aggreagation.

Example:
>>> from MEDS_tabular_automl.generate_ts_features import get_flat_ts_rep
>>> feature_columns = ['A/value/sum', 'A/code/count', 'B/value/sum', 'B/code/count',
... "C/value/sum", "C/code/count", "A/static/present"]
>>> data = {'patient_id': [1, 1, 1, 2, 2, 2],
... 'code': ['A', 'A', 'B', 'B', 'C', 'C'],
... 'timestamp': ['2021-01-01', '2021-01-01', '2020-01-01', '2021-01-04', None, None],
... 'numerical_value': [1, 2, 2, 2, 3, 4]}
>>> df = pl.DataFrame(data).lazy()
>>> df = get_flat_ts_rep(feature_columns, df)
>>> df
patient_id timestamp A/value B/value C/value A/code B/code C/code
0 1 2021-01-01 1 0 0 1 0 0
1 1 2021-01-01 2 0 0 1 0 0
2 1 2020-01-01 0 2 0 0 1 0
3 2 2021-01-04 0 2 0 0 1 0
>>> df['timestamp'] = pd.to_datetime(df['timestamp'])
>>> df.dtypes
patient_id int64
timestamp datetime64[ns]
A/value Sparse[int64, 0]
B/value Sparse[int64, 0]
C/value Sparse[int64, 0]
A/code Sparse[int64, 0]
B/code Sparse[int64, 0]
C/code Sparse[int64, 0]
dtype: object
>>> output = compute_agg(df[['patient_id', 'timestamp', 'A/code', 'B/code', 'C/code']],
... "1d", "code/count")
>>> output
1d/A/code/count 1d/B/code/count 1d/C/code/count timestamp patient_id
0 1 0 0 2021-01-01 1
1 2 0 0 2021-01-01 1
2 0 1 0 2020-01-01 1
0 0 1 0 2021-01-04 2
>>> output.dtypes
1d/A/code/count Sparse[int64, 0]
1d/B/code/count Sparse[int64, 0]
1d/C/code/count Sparse[int64, 0]
timestamp datetime64[ns]
patient_id int64
dtype: object
"""
group_df = (
index_df.with_row_index("index")
.group_by(["patient_id", "timestamp"], maintain_order=True)
.agg([pl.col("index").min().alias("min_index"), pl.col("index").max().alias("max_index")])
.collect()
)
index_df = group_df.lazy().select(pl.col("patient_id", "timestamp"))
windows = group_df.select(pl.col("min_index", "max_index"))
logger.info("Step 1.5: Running sparse aggregation.")
matrix = aggregate_matrix(windows, matrix, agg, num_features, use_tqdm)
logger.info("Step 2: computing rolling windows and aggregating.")
windows = get_rolling_window_indicies(index_df, window_size)
logger.info("Starting final sparse aggregations.")
matrix = aggregate_matrix(windows, matrix, agg, num_features, use_tqdm)
return matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify the function or add comments to improve readability.

The function compute_agg performs multiple complex operations. Consider breaking it down into smaller functions or adding detailed comments to explain each step.

Comment on lines +283 to +337
def generate_summary(
feature_columns: list[str], index_df: pl.LazyFrame, matrix: sparray, window_size, agg: str, use_tqdm=False
) -> pl.LazyFrame:
"""Generate a summary of the data frame for given window sizes and aggregations.

This function processes a dataframe to apply specified aggregations over defined window sizes.
It then joins the resulting frames on 'patient_id' and 'timestamp', and ensures all specified
feature columns exist in the final output, adding missing ones with default values.

Args:
feature_columns (list[str]): List of all feature columns that must exist in the final output.
df (list[pl.LazyFrame]): The input dataframes to process, expected to be length 2 list with code_df
(pivoted shard with binary presence of codes) and value_df (pivoted shard with numerical values
for each code).
window_sizes (list[str]): List of window sizes to apply for summarization.
aggregations (list[str]): List of aggregations to perform within each window size.

Returns:
pl.LazyFrame: A LazyFrame containing the summarized data with all required features present.

Expect:
>>> from datetime import date
>>> wide_df = pd.DataFrame({"patient_id": [1, 1, 1, 2],
... "A/code": [1, 1, 0, 0],
... "B/code": [0, 0, 1, 1],
... "A/value": [1, 2, 3, None],
... "B/value": [None, None, None, 4.0],
... "timestamp": [date(2021, 1, 1), date(2021, 1, 1),date(2020, 1, 3), date(2021, 1, 4)],
... })
>>> wide_df['timestamp'] = pd.to_datetime(wide_df['timestamp'])
>>> for col in ["A/code", "B/code", "A/value", "B/value"]:
... wide_df[col] = pd.arrays.SparseArray(wide_df[col])
>>> feature_columns = ["A/code/count", "B/code/count", "A/value/sum", "B/value/sum"]
>>> aggregations = ["code/count", "value/sum"]
>>> window_sizes = ["full", "1d"]
>>> generate_summary(feature_columns, wide_df, window_sizes, aggregations)[
... ["1d/A/code/count", "full/B/code/count", "full/B/value/sum"]]
1d/A/code/count full/B/code/count full/B/value/sum
0 NaN 1.0 0
1 NaN 1.0 0
2 NaN 1.0 0
0 NaN 1.0 0
0 NaN NaN 0
1 NaN NaN 0
2 NaN NaN 0
0 NaN NaN 0
0 0 NaN 0
1 1.0 NaN 0
2 2.0 NaN 0
0 0 NaN 0
0 NaN NaN 0
1 NaN NaN 0
2 NaN NaN 0
0 NaN NaN 0
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure all specified feature columns exist in the final output, adding missing ones with default values.

+ # Ensure all specified feature columns are present in the final output
+ missing_cols = [col for col in feature_columns if col not in df.columns]
+ if missing_cols:
+     for col in missing_cols:
+         df[col] = pd.arrays.SparseArray([0]*len(df), fill_value=0)

Committable suggestion was skipped due low confidence.

Comment on lines +28 to +41
def sparse_aggregate(sparse_matrix, agg):
if agg == "sum":
merged_matrix = sparse_matrix.sum(axis=0, dtype=sparse_matrix.dtype)
elif agg == "min":
merged_matrix = sparse_matrix.min(axis=0)
elif agg == "max":
merged_matrix = sparse_matrix.max(axis=0)
elif agg == "sum_sqd":
merged_matrix = sparse_matrix.power(2).sum(axis=0, dtype=sparse_matrix.dtype)
elif agg == "count":
merged_matrix = sparse_matrix.getnnz(axis=0)
else:
raise ValueError(f"Aggregation method '{agg}' not implemented.")
return merged_matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enhance error handling by providing a list of valid aggregation methods in the error message.

- raise ValueError(f"Aggregation method '{agg}' not implemented.")
+ raise ValueError(f"Aggregation method '{agg}' not implemented. Valid methods are: sum, min, max, sum_sqd, count.")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
def sparse_aggregate(sparse_matrix, agg):
if agg == "sum":
merged_matrix = sparse_matrix.sum(axis=0, dtype=sparse_matrix.dtype)
elif agg == "min":
merged_matrix = sparse_matrix.min(axis=0)
elif agg == "max":
merged_matrix = sparse_matrix.max(axis=0)
elif agg == "sum_sqd":
merged_matrix = sparse_matrix.power(2).sum(axis=0, dtype=sparse_matrix.dtype)
elif agg == "count":
merged_matrix = sparse_matrix.getnnz(axis=0)
else:
raise ValueError(f"Aggregation method '{agg}' not implemented.")
return merged_matrix
raise ValueError(f"Aggregation method '{agg}' not implemented. Valid methods are: sum, min, max, sum_sqd, count.")

Comment on lines +15 to +25
def time_aggd_col_alias_fntr(window_size: str, agg: str) -> Callable[[str], str]:
if agg is None:
raise ValueError("Aggregation type 'agg' must be provided")

def f(c: str) -> str:
if c in ["patient_id", "timestamp"]:
return c
else:
return "/".join([window_size] + c.split("/") + [agg])

return f
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure proper error handling by replacing assertions with exceptions.

- if agg is None:
-     raise ValueError("Aggregation type 'agg' must be provided")
+ if not isinstance(agg, str) or not agg:
+     raise ValueError("Aggregation type 'agg' must be provided and must be a non-empty string")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
def time_aggd_col_alias_fntr(window_size: str, agg: str) -> Callable[[str], str]:
if agg is None:
raise ValueError("Aggregation type 'agg' must be provided")
def f(c: str) -> str:
if c in ["patient_id", "timestamp"]:
return c
else:
return "/".join([window_size] + c.split("/") + [agg])
return f
def time_aggd_col_alias_fntr(window_size: str, agg: str) -> Callable[[str], str]:
if not isinstance(agg, str) or not agg:
raise ValueError("Aggregation type 'agg' must be provided and must be a non-empty string")
def f(c: str) -> str:
if c in ["patient_id", "timestamp"]:
return c
else:
return "/".join([window_size] + c.split("/") + [agg])
return f

Comment on lines +1 to +12
from collections.abc import Callable

import pandas as pd

pd.set_option("compute.use_numba", True)
import numpy as np
import polars as pl
from loguru import logger
from scipy.sparse import coo_array, csr_array, sparray, vstack

from MEDS_tabular_automl.generate_ts_features import get_feature_names, get_flat_ts_rep
from MEDS_tabular_automl.utils import CODE_AGGREGATIONS, VALUE_AGGREGATIONS, load_tqdm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorganize module-level imports to the top of the file.

+ import numpy as np
+ import pandas as pd
+ import polars as pl
+ from collections.abc import Callable
+ from loguru import logger
+ from scipy.sparse import coo_array, csr_array, sparray, vstack
+ from MEDS_tabular_automl.generate_ts_features import get_feature_names, get_flat_ts_rep
+ from MEDS_tabular_automl.utils import CODE_AGGREGATIONS, VALUE_AGGREGATIONS, load_tqdm
- from collections.abc import Callable
- import pandas as pd
- pd.set_option("compute.use_numba", True)
- import numpy as np
- import polars as pl
- from loguru import logger
- from scipy.sparse import coo_array, csr_array, sparray, vstack
- from MEDS_tabular_automl.generate_ts_features import get_feature_names, get_flat_ts_rep
- from MEDS_tabular_automl.utils import CODE_AGGREGATIONS, VALUE_AGGREGATIONS, load_tqdm

Committable suggestion was skipped due low confidence.

Tools
Ruff

6-6: Module level import not at top of file


7-7: Module level import not at top of file


8-8: Module level import not at top of file


9-9: Module level import not at top of file


11-11: Module level import not at top of file


12-12: Module level import not at top of file

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between cb21821 and 3a412a0.

Files selected for processing (1)
  • scripts/summarize_over_windows.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • scripts/summarize_over_windows.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Outside diff range and nitpick comments (2)
hf_cohort/aces_task_extraction.py (1)

36-76: Consider enhancing logging for better traceability of the data processing workflow.

Adding more detailed logging statements can help in debugging and understanding the flow of data, especially in a complex data processing pipeline like this one.

scripts/xgboost_sweep.py (1)

276-284: Clarify the choice between in-memory and iterator-based data structures in the _build method.

Consider adding comments to explain why one might choose to build data structures in memory versus using iterators, as this could help future maintainers understand the design decisions.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 3a412a0 and a4f1843.

Files selected for processing (8)
  • configs/xgboost_sweep.yaml (1 hunks)
  • hf_cohort/aces_task_extraction.py (1 hunks)
  • scripts/tabularize_static.py (1 hunks)
  • scripts/xgboost_sweep.py (1 hunks)
  • src/MEDS_tabular_automl/file_name.py (1 hunks)
  • src/MEDS_tabular_automl/generate_static_features.py (1 hunks)
  • src/MEDS_tabular_automl/utils.py (1 hunks)
  • tests/test_tabularize.py (1 hunks)
Files skipped from review as they are similar to previous changes (5)
  • configs/xgboost_sweep.yaml
  • scripts/tabularize_static.py
  • src/MEDS_tabular_automl/file_name.py
  • src/MEDS_tabular_automl/generate_static_features.py
  • src/MEDS_tabular_automl/utils.py
Additional context used
Ruff
tests/test_tabularize.py

5-5: Module level import not at top of file


6-6: Module level import not at top of file


7-7: Module level import not at top of file


8-8: Module level import not at top of file


10-10: Module level import not at top of file


11-11: Module level import not at top of file


12-12: Module level import not at top of file


14-14: Module level import not at top of file


15-20: Module level import not at top of file


21-21: Module level import not at top of file


22-22: Module level import not at top of file


23-23: Module level import not at top of file


24-24: Module level import not at top of file

Additional comments not posted (4)
hf_cohort/aces_task_extraction.py (1)

13-20: LGTM! The function get_events_df is well-implemented with clear logic for filtering and processing data frames.

scripts/xgboost_sweep.py (3)

21-48: Refactor Iterator class to improve readability and maintainability.

The Iterator class is quite large and handles multiple responsibilities. Consider breaking it down into smaller classes or functions, each handling a specific part of the data processing, such as loading data, filtering, and iteration management.


270-271: Add error handling and validation checks in train and evaluate methods.

+ # TODO: add in eval, early stopping, etc.
+ # TODO: check for Nan and inf in labels!

These TODOs are crucial for robust model training. Implementing early stopping, evaluation during training, and checks for NaN or infinite values in labels will help prevent overfitting and ensure the model's reliability.


319-346: Ensure consistent logging and error handling in the xgboost function.

Consider adding more detailed logging at each step of the model's configuration, training, and evaluation to help trace the execution flow and identify potential issues. Also, include error handling to manage exceptions that may occur during the model's operation.

Comment on lines +23 to +33
def get_unique_time_events_df(events_df: pl.DataFrame):
"""Updates Events DataFrame to have unique timestamps and sorted by patient_id and timestamp."""
assert events_df.select(pl.col("timestamp")).null_count().collect().item() == 0
# Check events_df is sorted - so it aligns with the ts_matrix we generate later in the pipeline
events_df = (
events_df.drop_nulls("timestamp")
.select(pl.col(["patient_id", "timestamp"]))
.unique(maintain_order=True)
)
assert events_df.sort(by=["patient_id", "timestamp"]).collect().equals(events_df.collect())
return events_df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider replacing assertions with explicit error handling for production environments.

Assertions are generally disabled in production Python environments (using -O flag), which could lead to unhandled errors. Consider using conditional checks and raising exceptions instead.

Comment on lines 194 to 322

# Store MEDS outputs
for split, data in MEDS_OUTPUTS.items():
file_path = MEDS_cohort_dir / "final_cohort" / f"{split}.parquet"
file_path.parent.mkdir(exist_ok=True)
df = pl.read_csv(StringIO(data))
df.with_columns(pl.col("timestamp").str.to_datetime("%Y-%m-%dT%H:%M:%S.%f")).write_parquet(
file_path
)

# Check the files are not empty
meds_files = f_name_resolver.list_meds_files()
assert len(meds_files) == 4, "MEDS Data Files Should be 4!"
for f in meds_files:
assert pl.read_parquet(f).shape[0] > 0, "MEDS Data Tabular Dataframe Should not be Empty!"

split_json = json.load(StringIO(SPLITS_JSON))
splits_fp = MEDS_cohort_dir / "splits.json"
json.dump(split_json, splits_fp.open("w"))
logger.info("caching flat representation of MEDS data")
store_columns(cfg)
assert (tabularized_data_dir / "config.yaml").is_file()
assert (tabularized_data_dir / "feature_columns.json").is_file()
assert (tabularized_data_dir / "feature_freqs.json").is_file()

feature_columns = json.load(open(f_name_resolver.get_feature_columns_fp()))
assert get_feature_names("code/count", feature_columns) == sorted(CODE_COLS)
assert get_feature_names("static/present", feature_columns) == sorted(STATIC_PRESENT_COLS)
assert get_feature_names("static/first", feature_columns) == sorted(STATIC_FIRST_COLS)
for value_agg in VALUE_AGGREGATIONS:
assert get_feature_names(value_agg, feature_columns) == sorted(VALUE_COLS)

# Check Static File Generation
tabularize_static_data(cfg)
actual_files = [str(Path(*f.parts[-5:])) for f in f_name_resolver.list_static_files()]
assert set(actual_files) == set(EXPECTED_STATIC_FILES)
# Check the files are not empty
for f in f_name_resolver.list_static_files():
static_matrix = load_matrix(f)
assert static_matrix.shape[0] > 0, "Static Data Tabular Dataframe Should not be Empty!"
expected_num_cols = len(get_feature_names(f"static/{f.stem}", feature_columns))
logger.info((static_matrix.shape[1], expected_num_cols))
logger.info(f_name_resolver.list_static_files())
assert static_matrix.shape[1] == expected_num_cols, (
f"Static Data Tabular Dataframe Should have {expected_num_cols}"
f"Columns but has {static_matrix.shape[1]}!"
)
static_first_fp = f_name_resolver.get_flat_static_rep("tuning", "0", "static/first")
static_present_fp = f_name_resolver.get_flat_static_rep("tuning", "0", "static/present")
assert (
load_matrix(static_first_fp).shape[0] == load_matrix(static_present_fp).shape[0]
), "static data first and present aggregations have different numbers of rows"

summarize_ts_data_over_windows(cfg)
# confirm summary files exist:
output_files = f_name_resolver.list_ts_files()
f_name_resolver.list_ts_files()
actual_files = [str(Path(*f.parts[-5:])) for f in output_files]

assert set(actual_files) == set(SUMMARIZE_EXPECTED_FILES)
for f in output_files:
sparse_array = load_matrix(f)
assert sparse_array.shape[0] > 0
assert sparse_array.shape[1] > 0
ts_code_fp = f_name_resolver.get_flat_ts_rep("tuning", "0", "365d", "code/count")
ts_value_fp = f_name_resolver.get_flat_ts_rep("tuning", "0", "365d", "value/sum")
assert (
load_matrix(ts_code_fp).shape[0] == load_matrix(ts_value_fp).shape[0]
), "time series code and value have different numbers of rows"
assert (
load_matrix(static_first_fp).shape[0] == load_matrix(ts_value_fp).shape[0]
), "static data and time series have different numbers of rows"

# Create fake labels
for f in f_name_resolver.list_meds_files():
df = pl.read_parquet(f)
df = get_events_df(df, feature_columns)
pseudo_labels = pl.Series(([0, 1] * df.shape[0])[: df.shape[0]])
df = df.with_columns(pl.Series(name="label", values=pseudo_labels))
df = df.select(pl.col(["patient_id", "timestamp", "label"]))
df = df.unique(subset=["patient_id", "timestamp"])
df = df.with_row_index("event_id")

split = f.parent.stem
shard_num = f.stem
out_f = f_name_resolver.get_label(split, shard_num)
out_f.parent.mkdir(parents=True, exist_ok=True)
df.write_parquet(out_f)
model_dir = Path(d) / "save_model"
xgboost_config_kwargs = {
"model_dir": str(model_dir.resolve()),
"hydra.mode": "MULTIRUN",
}
xgboost_config_kwargs = {**tabularize_config_kwargs, **xgboost_config_kwargs}
with initialize(version_base=None, config_path="../configs/"): # path to config.yaml
overrides = [f"{k}={v}" for k, v in xgboost_config_kwargs.items()]
cfg = compose(config_name="xgboost_sweep", overrides=overrides) # config.yaml
xgboost(cfg)
output_files = list(model_dir.glob("*/*/*_model.json"))
assert len(output_files) == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor test_tabularize into smaller, more focused tests.

This function covers multiple aspects of the data processing workflow, including setup, execution, and verification. Splitting this into smaller tests would improve maintainability and make it easier to identify issues with specific parts of the data processing pipeline.

Comment on lines +1 to +24
import rootutils

root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True)

import json
import tempfile
from io import StringIO
from pathlib import Path

import polars as pl
from hydra import compose, initialize
from loguru import logger

from MEDS_tabular_automl.file_name import FileNameResolver
from MEDS_tabular_automl.utils import (
VALUE_AGGREGATIONS,
get_events_df,
get_feature_names,
load_matrix,
)
from scripts.identify_columns import store_columns
from scripts.summarize_over_windows import summarize_ts_data_over_windows
from scripts.tabularize_static import tabularize_static_data
from scripts.xgboost_sweep import xgboost
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorganize import statements to the top of the file to follow Python's best practices.

+ import json
+ import tempfile
+ from io import StringIO
+ from pathlib import Path
+ import polars as pl
+ from hydra import compose, initialize
+ from loguru import logger
+ from MEDS_tabular_automl.file_name import FileNameResolver
+ from MEDS_tabular_automl.utils import VALUE_AGGREGATIONS, get_feature_names, load_matrix
+ from scripts.identify_columns import store_columns
+ from scripts.summarize_over_windows import summarize_ts_data_over_windows
+ from scripts.tabularize_static import tabularize_static_data
- import rootutils
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
import rootutils
root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True)
import json
import tempfile
from io import StringIO
from pathlib import Path
import polars as pl
from hydra import compose, initialize
from loguru import logger
from MEDS_tabular_automl.file_name import FileNameResolver
from MEDS_tabular_automl.utils import (
VALUE_AGGREGATIONS,
get_events_df,
get_feature_names,
load_matrix,
)
from scripts.identify_columns import store_columns
from scripts.summarize_over_windows import summarize_ts_data_over_windows
from scripts.tabularize_static import tabularize_static_data
from scripts.xgboost_sweep import xgboost
import json
import tempfile
from io import StringIO
from pathlib import Path
import polars as pl
from hydra import compose, initialize
from loguru import logger
from MEDS_tabular_automl.file_name import FileNameResolver
from MEDS_tabular_automl.utils import VALUE_AGGREGATIONS, get_feature_names, load_matrix
from scripts.identify_columns import store_columns
from scripts.summarize_over_windows import summarize_ts_data_over_windows
from scripts.tabularize_static import tabularize_static_data
import rootutils
root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True)
Tools
Ruff

5-5: Module level import not at top of file


6-6: Module level import not at top of file


7-7: Module level import not at top of file


8-8: Module level import not at top of file


10-10: Module level import not at top of file


11-11: Module level import not at top of file


12-12: Module level import not at top of file


14-14: Module level import not at top of file


15-20: Module level import not at top of file


21-21: Module level import not at top of file


22-22: Module level import not at top of file


23-23: Module level import not at top of file


24-24: Module level import not at top of file

Comment on lines +319 to +346
@hydra.main(version_base=None, config_path="../configs", config_name="xgboost_sweep")
def xgboost(cfg: DictConfig) -> float:
"""Optimize the model based on the provided configuration.

Args:
- cfg (DictConfig): Configuration dictionary.

Returns:
- float: Evaluation result.
"""
logger.debug("Initializing XGBoost model")
model = XGBoostModel(cfg)
logger.debug("Training XGBoost model")
time = datetime.now()
model.train()
logger.debug(f"Training took {datetime.now() - time}")
# save model
save_dir = (
Path(cfg.model_dir)
/ "_".join(map(str, cfg.window_sizes))
/ "_".join([agg.replace("/", "") for agg in cfg.aggs])
)
save_dir.mkdir(parents=True, exist_ok=True)

model.model.save_model(save_dir / f"{np.random.randint(100000, 999999)}_model.json")

return model.evaluate()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enhance the xgboost function with detailed logging and error handling.

Consider adding more detailed logging at each step of the model's configuration, training, and evaluation to help trace the execution flow and identify potential issues. Also, include error handling to manage exceptions that may occur during the model's operation.


Enhance the xgboost function with detailed logging and error handling.

Consider adding more detailed logging at each step of the model's configuration, training, and evaluation to help trace the execution flow and identify potential issues. Also, include error handling to manage exceptions that may occur during the model's operation.

Comment on lines +306 to +317
def evaluate(self) -> float:
"""Evaluate the model on the test set.

Returns:
- float: Evaluation metric (mae).
"""
# TODO: Figure out exactly what we want to do here

y_pred = self.model.predict(self.dtest)
y_true = self.dtest.get_label()
return mean_absolute_error(y_true, y_pred)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve robustness of the evaluate method with error handling and detailed logging.

Consider adding try-except blocks to handle potential exceptions during model evaluation. Additionally, include detailed logging to trace the computation of the evaluation metric.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Outside diff range and nitpick comments (1)
src/MEDS_tabular_automl/generate_summarized_reps.py (1)

83-109: Consider enhancing the documentation to clarify the operations performed during the rolling window aggregation.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between a4f1843 and 820e194.

Files selected for processing (1)
  • src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)
Additional context used
Ruff
src/MEDS_tabular_automl/generate_summarized_reps.py

6-6: Module level import not at top of file


7-7: Module level import not at top of file


8-8: Module level import not at top of file


9-9: Module level import not at top of file


11-11: Module level import not at top of file


12-12: Module level import not at top of file

Additional comments not posted (8)
src/MEDS_tabular_automl/generate_summarized_reps.py (8)

15-25: LGTM! The function correctly handles the case where agg is None and appropriately transforms column names.


28-41: LGTM! The function handles various aggregation methods and maintains type consistency where needed.


112-124: LGTM! The function correctly calculates indices for rolling windows and handles special cases appropriately.


127-160: LGTM! The function efficiently aggregates the matrix based on window indices and optionally displays progress.


163-227: LGTM! The function correctly applies aggregation to the dataframe and includes comprehensive documentation with examples.


230-278: LGTM! The function validates the aggregation type and processes the data frame to generate a summary.


281-349: LGTM! The function processes the data frame to apply specified aggregations and ensures all required feature columns are present in the final output.


351-375: LGTM! The main block correctly sets up the environment and processes the data using the defined functions.

Comment on lines +44 to +80
def sum_merge_timestamps(df, sparse_matrix, agg):
"""Groups by timestamp and combines rows that are on the same date.

The combining is done by summing the rows in the sparse matrix that correspond to the same date.

Args:
df (DataFrame): The DataFrame with 'timestamp' and 'patient_id'.
sparse_matrix (csr_matrix): The corresponding sparse matrix with data.
agg (str): Aggregation method, currently only 'sum' is implemented.

Returns:
DataFrame, csr_matrix: Tuple containing the DataFrame with aggregated timestamps and the corresponding
sparse matrix.
"""
# Assuming 'timestamp' is already sorted; if not, uncomment the next line:
# df = df.sort_values(by='timestamp')

# Group by timestamp and sum the data
grouped = df.groupby("timestamp")
indices = grouped.indices

# Create a new sparse matrix with summed rows per unique timestamp
patient_id = df["patient_id"].iloc[0]
timestamps = []
output_matrix = csr_array((0, sparse_matrix.shape[1]), dtype=sparse_matrix.dtype)

# Loop through each group and sum
for timestamp, rows in indices.items():
# Combine the rows in the sparse matrix for the current group (respecting the aggregation being used)
merged_matrix = sparse_aggregate(sparse_matrix[rows], agg)
# Save the non-zero elements
output_matrix = vstack([output_matrix, merged_matrix])
timestamps.extend([timestamp])

# Create output DataFrame
out_df = pd.DataFrame({"patient_id": [patient_id] * len(timestamps), "timestamp": timestamps})
return out_df, output_matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider ensuring timestamps are sorted within the function to avoid potential issues if the assumption is not met externally.

- # Assuming 'timestamp' is already sorted; if not, uncomment the next line:
+ df = df.sort_values(by='timestamp')
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
def sum_merge_timestamps(df, sparse_matrix, agg):
"""Groups by timestamp and combines rows that are on the same date.
The combining is done by summing the rows in the sparse matrix that correspond to the same date.
Args:
df (DataFrame): The DataFrame with 'timestamp' and 'patient_id'.
sparse_matrix (csr_matrix): The corresponding sparse matrix with data.
agg (str): Aggregation method, currently only 'sum' is implemented.
Returns:
DataFrame, csr_matrix: Tuple containing the DataFrame with aggregated timestamps and the corresponding
sparse matrix.
"""
# Assuming 'timestamp' is already sorted; if not, uncomment the next line:
# df = df.sort_values(by='timestamp')
# Group by timestamp and sum the data
grouped = df.groupby("timestamp")
indices = grouped.indices
# Create a new sparse matrix with summed rows per unique timestamp
patient_id = df["patient_id"].iloc[0]
timestamps = []
output_matrix = csr_array((0, sparse_matrix.shape[1]), dtype=sparse_matrix.dtype)
# Loop through each group and sum
for timestamp, rows in indices.items():
# Combine the rows in the sparse matrix for the current group (respecting the aggregation being used)
merged_matrix = sparse_aggregate(sparse_matrix[rows], agg)
# Save the non-zero elements
output_matrix = vstack([output_matrix, merged_matrix])
timestamps.extend([timestamp])
# Create output DataFrame
out_df = pd.DataFrame({"patient_id": [patient_id] * len(timestamps), "timestamp": timestamps})
return out_df, output_matrix
# Ensure 'timestamp' is sorted
df = df.sort_values(by='timestamp')

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 820e194 and 4b0637a.

Files selected for processing (1)
  • hf_cohort/aces_task_extraction.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • hf_cohort/aces_task_extraction.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 4b0637a and 23877ad.

Files selected for processing (1)
  • hf_cohort/hf_cohort_e2e.sh (1 hunks)
Additional context used
Shellcheck
hf_cohort/hf_cohort_e2e.sh

[warning] 4-4: N_RUNS appears unused. Verify use (or export if used externally).


[warning] 5-5: OUTPUT_BASE appears unused. Verify use (or export if used externally).


[warning] 29-29: POLARS_MAX_THREADS appears unused. Verify use (or export if used externally).


[info] 52-52: Double quote to prevent globbing and word splitting.

Additional comments not posted (1)
hf_cohort/hf_cohort_e2e.sh (1)

10-10: Ensure N_PARALLEL_WORKERS is properly validated.

It's important to validate the input for N_PARALLEL_WORKERS to ensure it's a positive integer. This prevents potential runtime errors or unexpected behaviors.

Verification successful

The validation for N_PARALLEL_WORKERS is correctly implemented and ensures it is a positive integer.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify that N_PARALLEL_WORKERS is a positive integer.
if ! [[ "$1" =~ ^[0-9]+$ ]] ; then
   echo "Error: N_PARALLEL_WORKERS is not a positive integer."
   exit 1
fi

Length of output: 146

echo "build_dataset.sh failed with status $cmd_exit_status."
echo "Stderr from build_dataset.sh (see $LOG_DIR/cmd.stderr):"
tail $LOG_DIR/cmd.stderr
exit $cmd_exit_status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct the shell syntax for robustness.

The exit command should be properly quoted to prevent globbing and word splitting, ensuring the script's robustness and security.

- exit $cmd_exit_status
+ exit "$cmd_exit_status"
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
exit $cmd_exit_status
exit "$cmd_exit_status"
Tools
Shellcheck

[info] 52-52: Double quote to prevent globbing and word splitting.

Comment on lines +16 to +21
echo "Running identify_columns.py: Caching feature names and frequencies."
rm -rf $OUTPUT_DIR
POLARS_MAX_THREADS=32 python scripts/identify_columns.py \
MEDS_cohort_dir=$MEDS_DIR \
tabularized_data_dir=$OUTPUT_DIR \
min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding error handling for the rm and python commands.

The script lacks error handling for critical operations like removing directories and running Python scripts. Adding error checks will improve the robustness of the script.

+ if ! rm -rf $OUTPUT_DIR; then
+    echo "Failed to remove $OUTPUT_DIR"
+    exit 1
+ fi
+ if ! POLARS_MAX_THREADS=32 python scripts/identify_columns.py \
+    MEDS_cohort_dir=$MEDS_DIR \
+    tabularized_data_dir=$OUTPUT_DIR \
+    min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"; then
+    echo "Python script failed"
+    exit 1
+ fi
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
echo "Running identify_columns.py: Caching feature names and frequencies."
rm -rf $OUTPUT_DIR
POLARS_MAX_THREADS=32 python scripts/identify_columns.py \
MEDS_cohort_dir=$MEDS_DIR \
tabularized_data_dir=$OUTPUT_DIR \
min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"
echo "Running identify_columns.py: Caching feature names and frequencies."
if ! rm -rf $OUTPUT_DIR; then
echo "Failed to remove $OUTPUT_DIR"
exit 1
fi
if ! POLARS_MAX_THREADS=32 python scripts/identify_columns.py \
MEDS_cohort_dir=$MEDS_DIR \
tabularized_data_dir=$OUTPUT_DIR \
min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"; then
echo "Python script failed"
exit 1
fi

Comment on lines +11 to +12
WINDOW_SIZES="window_sizes=[1d,7d,30d,365d,full]"
AGGS="aggs=[static/present,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

Codebase Verification

The variables WINDOW_SIZES and AGGS are defined in the script hf_cohort/hf_cohort_e2e.sh, but they appear to be empty when the script is executed. This suggests that the variables might not be properly exported or sourced before the format check.

  • Ensure that WINDOW_SIZES and AGGS are correctly set and exported before the format check in the script.
Analysis chain

Review the aggregation and window size parameters.

The parameters for window sizes and aggregations are critical for the script's operation. Ensure these are correctly formatted and match the expected patterns used in downstream processing.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the format of WINDOW_SIZES and AGGS.
if ! [[ "$WINDOW_SIZES" =~ ^window_sizes=\[.*\]$ ]] ; then
   echo "Error: WINDOW_SIZES format is incorrect."
   exit 1
fi
if ! [[ "$AGGS" =~ ^aggs=\[.*\]$ ]] ; then
   echo "Error: AGGS format is incorrect."
   exit 1
fi

Length of output: 134



Script:

#!/bin/bash
# Description: Print the values of WINDOW_SIZES and AGGS to verify their correctness.
echo "WINDOW_SIZES: $WINDOW_SIZES"
echo "AGGS: $AGGS"

Length of output: 61



Script:

#!/bin/bash
# Description: Search for the assignments of WINDOW_SIZES and AGGS in the codebase.
rg 'WINDOW_SIZES='
rg 'AGGS='

Length of output: 560

tabularized_data_dir=$OUTPUT_DIR \
min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"

POLARS_MAX_THREADS=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

Codebase Verification

The variable POLARS_MAX_THREADS is set to 1 in hf_cohort/hf_cohort_e2e.sh but is not used afterward in the script. It is used with a value of 32 in other parts of the same script and in hf_cohort/hf_cohort_shard.sh.

  • hf_cohort/hf_cohort_e2e.sh: POLARS_MAX_THREADS=32 is used with Python scripts.
  • hf_cohort/hf_cohort_shard.sh: POLARS_MAX_THREADS=32 is used with Python scripts.

However, the specific instance of POLARS_MAX_THREADS=1 does not appear to be utilized.

Analysis chain

Verify the usage of POLARS_MAX_THREADS.

The variable POLARS_MAX_THREADS is set to 1 without subsequent usage, which might be an oversight or error. Confirm its intended use.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Check if POLARS_MAX_THREADS is used after being set.
grep -q 'POLARS_MAX_THREADS' hf_cohort/hf_cohort_e2e.sh
if [ $? -ne 0 ]; then
    echo "POLARS_MAX_THREADS set but not used."
fi

Length of output: 73



Script:

#!/bin/bash
# Description: Check if POLARS_MAX_THREADS is used anywhere in the repository after being set.
grep -r 'POLARS_MAX_THREADS' .

Length of output: 905

Tools
Shellcheck

[warning] 29-29: POLARS_MAX_THREADS appears unused. Verify use (or export if used externally).

@mmcdermott mmcdermott mentioned this pull request Jun 2, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 23877ad and 36f54a3.

Files selected for processing (1)
  • .github/workflows/tests.yaml (2 hunks)
Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/tests.yaml

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 36f54a3 and e7a85ba.

Files selected for processing (1)
  • .github/workflows/tests.yaml (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/tests.yaml

…perties, but don't know what the desired behavior is for the error
Some fixes for pre-commit, testing, etc. NOT READY FOR MERGING YET
@Oufattole Oufattole merged commit f7605b5 into main Jun 5, 2024
0 of 2 checks passed
@mmcdermott mmcdermott deleted the esgpt_caching branch June 12, 2024 14:08
@coderabbitai coderabbitai bot mentioned this pull request Nov 6, 2024
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants