-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting Esgpt caching to work for MEDS datasets #1
Conversation
…ts is correct when producing static representations
Warning Rate limit exceeded@mmcdermott has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 17 minutes and 12 seconds before requesting another review. How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughThe recent updates encompass a wide range of enhancements and additions to the MEDS tabular automation system. Key changes include the introduction of new configuration settings, the addition of several Python packages, updates to shell scripts for data processing, and new functions for summarizing and generating features from medical datasets. The updates also include improvements to the testing setup and the README documentation, ensuring a comprehensive and functional pipeline for handling and analyzing medical data. Changes
Sequence Diagram(s) (Beta)sequenceDiagram
participant User
participant Script
participant Config
participant DataProcessor
participant Summarizer
participant Storage
User->>Script: Run hf_cohort_e2e.sh
Script->>Config: Load configurations
Script->>DataProcessor: Process medical data
DataProcessor->>Summarizer: Summarize data over windows
Summarizer->>Storage: Save summarized data
Storage-->>User: Data processing complete
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Review Details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (6)
- configs/tabularize.yaml (1 hunks)
- src/MEDS_tabular_automl/generate_static_features.py (1 hunks)
- src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
- src/MEDS_tabular_automl/tabularize.py (1 hunks)
- src/MEDS_tabular_automl/utils.py (1 hunks)
- tests/test_tabularize.py (1 hunks)
Additional Context Used
Ruff (8)
tests/test_tabularize.py (8)
5-5: Module level import not at top of file
6-6: Module level import not at top of file
7-7: Module level import not at top of file
8-8: Module level import not at top of file
10-10: Module level import not at top of file
11-11: Module level import not at top of file
12-12: Module level import not at top of file
14-14: Module level import not at top of file
Additional comments not posted (14)
configs/tabularize.yaml (2)
23-24
: Added new configuration settings for dynamic and numerical value thresholds.These settings are crucial for defining the behavior of dynamic and numerical value processing in the dataset. Ensure that these values are well-documented and tested to confirm they meet the intended thresholds.
31-31
: Enabled thedo_update
setting.This change allows the system to update existing data rather than overwriting it, which can be beneficial for incremental updates. Ensure that this behavior is consistent with the system's requirements and that it doesn't lead to data integrity issues.
src/MEDS_tabular_automl/generate_static_features.py (2)
20-84
: Added function_summarize_static_measurements
to aggregate static measurements.This function efficiently aggregates static measurements and pivots them into a tabular format. It's well-documented and uses list comprehensions and filtering effectively. Ensure that the feature columns used are correctly prefixed and that the pivot operation is optimized as suggested in the TODO comment.
87-112
: Added functionget_flat_static_rep
to produce a tabular representation of static data features.This function wraps the
_summarize_static_measurements
function and normalizes the data, which is crucial for ensuring data consistency before further processing. The function is well-structured and documented. Ensure that the normalization process handles all edge cases appropriately.tests/test_tabularize.py (1)
101-138
: Added a comprehensive test functiontest_tabularize
for the tabularization process.This test function is well-structured and covers the creation of directories, storing of data, and the tabularization process. It uses temporary directories to avoid side effects, which is a good practice. Ensure that all edge cases and possible errors during the tabularization process are handled and tested.
src/MEDS_tabular_automl/generate_ts_features.py (3)
19-131
: Added function_summarize_dynamic_measurements
to generate dynamic measurements.This function handles the inclusion of subjects dynamically and aggregates measurements based on configuration settings. It's well-documented and uses advanced Polars operations effectively. Ensure that the configuration settings (
measurement_configs
) are robust and cover all necessary scenarios.
134-214
: Added function_summarize_over_window
to apply aggregations over a specified window.This function provides flexibility in handling different window sizes and applies various aggregations based on the window size. It's a complex function that handles many edge cases, and it's well-documented. Ensure that the performance of this function is monitored, especially with large datasets.
217-229
: Added functionget_flat_ts_rep
to produce a raw representation for dynamic data.This function integrates the dynamic measurements summarization and normalization into a single workflow. It's concise and leverages lazy evaluation effectively. Ensure that the sort and collect operations are necessary and do not introduce performance bottlenecks.
src/MEDS_tabular_automl/utils.py (6)
27-27
: DefinedDF_T
as a new data type for internal dataframes.Defining a specific type for dataframes (
DF_T
) improves code readability and maintainability by making the expected data type clear throughout the codebase.
38-51
: Added functionwrite_df
to write dataframes to disk.This function provides a robust way to write dataframes to disk with an option to overwrite existing files. It handles file paths and directories effectively and uses the Polars library for writing data. Ensure that error handling is robust, especially for file writing operations.
53-87
: Added functionget_smallest_valid_uint_type
to determine the smallest valid unsigned integer type.This function is crucial for optimizing memory usage by selecting the smallest possible data type that can still represent the data adequately. It's well-documented and includes comprehensive examples. Ensure that the function is used consistently throughout the codebase to maintain data type optimizations.
108-149
: Enhanced_normalize_flat_rep_df_cols
to handle data normalization and type setting.This function ensures that all necessary columns are added to a dataframe and that each column is set to the correct data type. It also handles zero counts, which can be important for certain types of data analysis. Ensure that this function is tested thoroughly to handle all edge cases.
152-213
: Addedevaluate_code_properties
to categorize codes based on their properties.This function categorizes codes dynamically based on configuration settings and is a key component of the feature generation process. It's well-documented and handles various edge cases. Ensure that the configuration settings are accurate and that the function's output is validated against expected results.
216-269
: Addedget_code_column
andget_flat_rep_feature_cols
to generate feature column names.These functions are essential for generating feature columns based on the code properties. They are well-documented and use the
CodeType
enum effectively to ensure that the feature columns are generated correctly. Ensure that these functions are integrated properly with the rest of the feature generation pipeline.
- Create individual scripts for: 1. Retrieving column names 2. Generating static representations 3. Generating dynamic representations 4. Summarizing data over windows - Add doctests for retrieving column names and generating static representations - Confirm functionality of the above tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Review Details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (10)
- .pre-commit-config.yaml (1 hunks)
- scripts/identify_columns.py (1 hunks)
- scripts/summarize_over_windows.py (1 hunks)
- scripts/tabularize_static.py (1 hunks)
- scripts/tabularize_ts.py (1 hunks)
- src/MEDS_tabular_automl/generate_static_features.py (1 hunks)
- src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
- src/MEDS_tabular_automl/tabularize.py (1 hunks)
- src/MEDS_tabular_automl/utils.py (1 hunks)
- tests/test_tabularize.py (1 hunks)
Files skipped from review due to trivial changes (1)
- .pre-commit-config.yaml
Files skipped from review as they are similar to previous changes (3)
- src/MEDS_tabular_automl/generate_static_features.py
- src/MEDS_tabular_automl/generate_ts_features.py
- src/MEDS_tabular_automl/tabularize.py
Additional Context Used
Ruff (11)
tests/test_tabularize.py (11)
5-5: Module level import not at top of file
6-6: Module level import not at top of file
7-7: Module level import not at top of file
8-8: Module level import not at top of file
10-10: Module level import not at top of file
11-11: Module level import not at top of file
12-12: Module level import not at top of file
14-14: Module level import not at top of file
15-15: Module level import not at top of file
16-16: Module level import not at top of file
17-17: Module level import not at top of file
Additional comments not posted (3)
scripts/tabularize_ts.py (1)
8-51
: Well-documented function setup fortabularize_ts_data
.The function is well-documented with clear descriptions of its purpose and parameters. The use of
hydra.main
for configuration andsetup_environment
for environment setup is appropriate.scripts/tabularize_static.py (1)
19-125
: Well-implemented functions for configuration storage and data processing.Both
store_config_yaml
andtabularize_static_data
are well-documented and handle file operations and data processing effectively. The error handling and configuration management are appropriately implemented.src/MEDS_tabular_automl/utils.py (1)
21-240
: Comprehensive utility functions for data processing and configuration management.The utility functions in
src/MEDS_tabular_automl/utils.py
are well-documented and effectively handle data processing, feature column parsing, and configuration management. The error handling and data type management are appropriately implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review Details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (3)
- .github/workflows/tests.yaml (1 hunks)
- src/MEDS_tabular_automl/tabularize.py (1 hunks)
- src/MEDS_tabular_automl/utils.py (1 hunks)
Files skipped from review due to trivial changes (1)
- .github/workflows/tests.yaml
Files skipped from review as they are similar to previous changes (2)
- src/MEDS_tabular_automl/tabularize.py
- src/MEDS_tabular_automl/utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review Details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- scripts/identify_columns.py (1 hunks)
- scripts/tabularize_ts.py (1 hunks)
- src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
- src/MEDS_tabular_automl/utils.py (1 hunks)
- tests/test_tabularize.py (1 hunks)
Files skipped from review as they are similar to previous changes (3)
- scripts/identify_columns.py
- scripts/tabularize_ts.py
- src/MEDS_tabular_automl/utils.py
Additional Context Used
Ruff (11)
tests/test_tabularize.py (11)
5-5: Module level import not at top of file
6-6: Module level import not at top of file
7-7: Module level import not at top of file
8-8: Module level import not at top of file
10-10: Module level import not at top of file
11-11: Module level import not at top of file
12-12: Module level import not at top of file
14-14: Module level import not at top of file
15-15: Module level import not at top of file
16-16: Module level import not at top of file
17-17: Module level import not at top of file
Additional comments not posted (3)
src/MEDS_tabular_automl/generate_ts_features.py (2)
Line range hint
87-162
: The implementation ofget_flat_ts_rep
correctly utilizes dynamic features for time series representation. Good use of filtering and summarization.
23-84
: Renaming to_summarize_dynamic_measurements
indicates internal use. Ensure all references in the codebase are updated accordingly.Verification successful
The occurrences of the old function name
summarize_dynamic_measurements
have been successfully updated to_summarize_dynamic_measurements
in thesrc/MEDS_tabular_automl/generate_ts_features.py
file.
def summarize_dynamic_measurements
->def _summarize_dynamic_measurements
- Example in the docstring updated.
- Return statement updated.
Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all function calls to `_summarize_dynamic_measurements` match the new name. # Test: Search for the old function name usage. Expect: No occurrences of the old name. rg --type python $'summarize_dynamic_measurements'Length of output: 80
Script:
#!/bin/bash # Description: Verify all function calls to `_summarize_dynamic_measurements` match the new name. # Test: Search for the old function name usage. Expect: No occurrences of the old name. rg --glob '*.py' 'summarize_dynamic_measurements'Length of output: 371
Script:
#!/bin/bash # Description: Update occurrences of the old function name to the new function name. # Update the function definition sed -i 's/def summarize_dynamic_measurements/def _summarize_dynamic_measurements/' src/MEDS_tabular_automl/generate_ts_features.py # Update the example in the docstring sed -i 's/summarize_dynamic_measurements(ts_columns, df)/_summarize_dynamic_measurements(ts_columns, df)/' src/MEDS_tabular_automl/generate_ts_features.py # Update the return statement sed -i 's/return summarize_dynamic_measurements/return _summarize_dynamic_measurements/' src/MEDS_tabular_automl/generate_ts_features.pyLength of output: 428
tests/test_tabularize.py (1)
104-162
: The test logic intest_tabularize
is well-structured and covers the necessary functionality effectively.
…dexed file types: - Code data: Files with a patient_id and timestamp index, containing columns for each code with binary presence indicators (1 and 0). - Value data: Files similar in structure but containing the numerical values observed for each code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Review Details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- scripts/identify_columns.py (1 hunks)
- scripts/tabularize_ts.py (1 hunks)
- src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
- src/MEDS_tabular_automl/utils.py (1 hunks)
- tests/test_tabularize.py (1 hunks)
Files skipped from review as they are similar to previous changes (3)
- scripts/identify_columns.py
- scripts/tabularize_ts.py
- src/MEDS_tabular_automl/utils.py
Additional Context Used
Ruff (11)
tests/test_tabularize.py (11)
5-5: Module level import not at top of file
6-6: Module level import not at top of file
7-7: Module level import not at top of file
8-8: Module level import not at top of file
10-10: Module level import not at top of file
11-11: Module level import not at top of file
12-12: Module level import not at top of file
14-14: Module level import not at top of file
15-15: Module level import not at top of file
16-16: Module level import not at top of file
17-17: Module level import not at top of file
Additional comments not posted (2)
src/MEDS_tabular_automl/generate_ts_features.py (2)
18-79
: The functionsummarize_dynamic_measurements
is well-documented and the example provided is clear and illustrative.
Line range hint
82-162
: The functionget_flat_ts_rep
is well-documented and effectively utilizes thesummarize_dynamic_measurements
function to process dynamic features.
…n handling - Introduce VALID_AGGREGATIONS to define permissible aggregations. - Implement to generate dynamic column aliases based on window size and aggregation. - Extend for dynamic expression creation based on aggregation type and window size, handling both cumulative and windowed aggregations. - Enhance to apply specified aggregations over defined window sizes, ensuring comprehensive data summarization. - Update to handle multiple dataframes, aggregate data using specified window sizes and aggregations, and ensure inclusion of all specified feature columns, adding missing ones with default values. - Add extensive doctests to ensure accuracy of the summarization functions, demonstrating usage with both code and value data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Review Details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- configs/tabularize.yaml (1 hunks)
- scripts/summarize_over_windows.py (1 hunks)
- src/MEDS_tabular_automl/generate_static_features.py (1 hunks)
- src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)
- src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
Files skipped from review as they are similar to previous changes (4)
- configs/tabularize.yaml
- scripts/summarize_over_windows.py
- src/MEDS_tabular_automl/generate_static_features.py
- src/MEDS_tabular_automl/generate_ts_features.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review Details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- src/MEDS_tabular_automl/generate_summarized_reps.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write a brief summary of how these are expected to be used / run in the README?
Also, on all scripts, you need to start with #!/usr/bin/env python
and have the if __name__ == "__main__": main()
block at the end
configs/tabularize.yaml
Outdated
@@ -8,27 +8,23 @@ window_sizes: ??? | |||
codes: null | |||
aggs: | |||
- "code/count" | |||
- "code/time_since_last" | |||
- "code/time_since_first" | |||
- "code/present" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need these in addition to "*/count"? The two should actually be redundant, no?
or simply as present, then performs a pivot to reshape the data for each patient, providing | ||
a tabular format where each row represents a patient and each column represents a static feature. | ||
""" | ||
static_present = [c for c in feature_columns if c.startswith("STATIC_") and c.endswith("present")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anything with a null timestamp should be assumed to be a static code. They won't all start with "STATIC_" (unless you've done some pre-processing here I'm missing.
…ressing comments in other files
…toML into esgpt_caching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review Details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (7)
- README.md (3 hunks)
- configs/tabularize.yaml (1 hunks)
- scripts/identify_columns.py (1 hunks)
- scripts/summarize_over_windows.py (1 hunks)
- src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)
- src/MEDS_tabular_automl/generate_ts_features.py (1 hunks)
- tests/test_tabularize.py (1 hunks)
Files skipped from review as they are similar to previous changes (5)
- configs/tabularize.yaml
- scripts/identify_columns.py
- scripts/summarize_over_windows.py
- src/MEDS_tabular_automl/generate_summarized_reps.py
- src/MEDS_tabular_automl/generate_ts_features.py
Additional Context Used
LanguageTool (2)
README.md (2)
Near line 54: It appears that a hyphen is missing in the plural noun “to-dos”?
Context: ...bularize.yaml`. ## AutoML Pipelines # TODOs 1. Leverage the "event bound aggregati...
Rule ID: TO_DO_HYPHEN
Near line 70: Possible verb agreement error. Did you mean “formats”? (Some collective nouns can be treated as both singular and plural, so ‘format’ is not always incorrect.)
Context: ...n - MEDS_cohort_dir: directory of MEDS format dataset that is ingested. - tabularized...
Rule ID: COLLECTIVE_NOUN_VERB_AGREEMENT_VBP
Ruff (11)
tests/test_tabularize.py (11)
5-5: Module level import not at top of file
6-6: Module level import not at top of file
7-7: Module level import not at top of file
8-8: Module level import not at top of file
10-10: Module level import not at top of file
11-11: Module level import not at top of file
12-12: Module level import not at top of file
14-14: Module level import not at top of file
15-15: Module level import not at top of file
16-16: Module level import not at top of file
17-17: Module level import not at top of file
Markdownlint (5)
README.md (5)
6: Expected: h2; Actual: h4
Heading levels should only increment by one level at a time
27: null
Multiple top-level headings in the same document
31: null
Multiple top-level headings in the same document
54: null
Multiple top-level headings in the same document
68: null
Multiple top-level headings in the same document
Additional comments not posted (6)
README.md (5)
Line range hint
1-25
: The project description is clear and effectively sets the stage for understanding the unique challenges and solutions provided by the repository.
28-28
: Installation instructions are concise and clear.
32-43
: The usage section effectively communicates the core functionalities of the repository.
55-67
: The TODOs are well-defined and provide clear directions for future enhancements.
68-90
: The configuration arguments are thoroughly documented, enhancing the usability of the repository.tests/test_tabularize.py (1)
104-179
: The test function is well-structured and covers a comprehensive range of scenarios for testing the tabularization process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)
Additional context used
Ruff
src/MEDS_tabular_automl/generate_summarized_reps.py
6-6: Module level import not at top of file
7-7: Module level import not at top of file
8-8: Module level import not at top of file
9-9: Module level import not at top of file
11-11: Module level import not at top of file
12-12: Module level import not at top of file
Additional comments not posted (3)
src/MEDS_tabular_automl/generate_summarized_reps.py (3)
112-126
: Add input validation forwindow_size
.
83-109
: Add input validation forwindow_size
.+ if not isinstance(window_size, str): + raise TypeError("window_size must be a string")Likely invalid or redundant comment.
44-80
: Ensure timestamps are sorted before grouping.- # Assuming 'timestamp' is already sorted; if not, uncomment the next line: + df = df.sort_values(by='timestamp')Likely invalid or redundant comment.
def compute_agg(index_df, matrix: sparray, window_size: str, agg: str, num_features: int, use_tqdm=False): | ||
"""Applies aggreagtion to dataframe. | ||
|
||
Dataframe is expected to only have the relevant columns for aggregating | ||
It should have the patient_id and timestamp columns, and then only code columns | ||
if agg is a code aggregation or only value columns if it is a value aggreagation. | ||
|
||
Example: | ||
>>> from MEDS_tabular_automl.generate_ts_features import get_flat_ts_rep | ||
>>> feature_columns = ['A/value/sum', 'A/code/count', 'B/value/sum', 'B/code/count', | ||
... "C/value/sum", "C/code/count", "A/static/present"] | ||
>>> data = {'patient_id': [1, 1, 1, 2, 2, 2], | ||
... 'code': ['A', 'A', 'B', 'B', 'C', 'C'], | ||
... 'timestamp': ['2021-01-01', '2021-01-01', '2020-01-01', '2021-01-04', None, None], | ||
... 'numerical_value': [1, 2, 2, 2, 3, 4]} | ||
>>> df = pl.DataFrame(data).lazy() | ||
>>> df = get_flat_ts_rep(feature_columns, df) | ||
>>> df | ||
patient_id timestamp A/value B/value C/value A/code B/code C/code | ||
0 1 2021-01-01 1 0 0 1 0 0 | ||
1 1 2021-01-01 2 0 0 1 0 0 | ||
2 1 2020-01-01 0 2 0 0 1 0 | ||
3 2 2021-01-04 0 2 0 0 1 0 | ||
>>> df['timestamp'] = pd.to_datetime(df['timestamp']) | ||
>>> df.dtypes | ||
patient_id int64 | ||
timestamp datetime64[ns] | ||
A/value Sparse[int64, 0] | ||
B/value Sparse[int64, 0] | ||
C/value Sparse[int64, 0] | ||
A/code Sparse[int64, 0] | ||
B/code Sparse[int64, 0] | ||
C/code Sparse[int64, 0] | ||
dtype: object | ||
>>> output = compute_agg(df[['patient_id', 'timestamp', 'A/code', 'B/code', 'C/code']], | ||
... "1d", "code/count") | ||
>>> output | ||
1d/A/code/count 1d/B/code/count 1d/C/code/count timestamp patient_id | ||
0 1 0 0 2021-01-01 1 | ||
1 2 0 0 2021-01-01 1 | ||
2 0 1 0 2020-01-01 1 | ||
0 0 1 0 2021-01-04 2 | ||
>>> output.dtypes | ||
1d/A/code/count Sparse[int64, 0] | ||
1d/B/code/count Sparse[int64, 0] | ||
1d/C/code/count Sparse[int64, 0] | ||
timestamp datetime64[ns] | ||
patient_id int64 | ||
dtype: object | ||
""" | ||
group_df = ( | ||
index_df.with_row_index("index") | ||
.group_by(["patient_id", "timestamp"], maintain_order=True) | ||
.agg([pl.col("index").min().alias("min_index"), pl.col("index").max().alias("max_index")]) | ||
.collect() | ||
) | ||
index_df = group_df.lazy().select(pl.col("patient_id", "timestamp")) | ||
windows = group_df.select(pl.col("min_index", "max_index")) | ||
logger.info("Step 1.5: Running sparse aggregation.") | ||
matrix = aggregate_matrix(windows, matrix, agg, num_features, use_tqdm) | ||
logger.info("Step 2: computing rolling windows and aggregating.") | ||
windows = get_rolling_window_indicies(index_df, window_size) | ||
logger.info("Starting final sparse aggregations.") | ||
matrix = aggregate_matrix(windows, matrix, agg, num_features, use_tqdm) | ||
return matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplify the function or add comments to improve readability.
The function compute_agg
performs multiple complex operations. Consider breaking it down into smaller functions or adding detailed comments to explain each step.
def generate_summary( | ||
feature_columns: list[str], index_df: pl.LazyFrame, matrix: sparray, window_size, agg: str, use_tqdm=False | ||
) -> pl.LazyFrame: | ||
"""Generate a summary of the data frame for given window sizes and aggregations. | ||
|
||
This function processes a dataframe to apply specified aggregations over defined window sizes. | ||
It then joins the resulting frames on 'patient_id' and 'timestamp', and ensures all specified | ||
feature columns exist in the final output, adding missing ones with default values. | ||
|
||
Args: | ||
feature_columns (list[str]): List of all feature columns that must exist in the final output. | ||
df (list[pl.LazyFrame]): The input dataframes to process, expected to be length 2 list with code_df | ||
(pivoted shard with binary presence of codes) and value_df (pivoted shard with numerical values | ||
for each code). | ||
window_sizes (list[str]): List of window sizes to apply for summarization. | ||
aggregations (list[str]): List of aggregations to perform within each window size. | ||
|
||
Returns: | ||
pl.LazyFrame: A LazyFrame containing the summarized data with all required features present. | ||
|
||
Expect: | ||
>>> from datetime import date | ||
>>> wide_df = pd.DataFrame({"patient_id": [1, 1, 1, 2], | ||
... "A/code": [1, 1, 0, 0], | ||
... "B/code": [0, 0, 1, 1], | ||
... "A/value": [1, 2, 3, None], | ||
... "B/value": [None, None, None, 4.0], | ||
... "timestamp": [date(2021, 1, 1), date(2021, 1, 1),date(2020, 1, 3), date(2021, 1, 4)], | ||
... }) | ||
>>> wide_df['timestamp'] = pd.to_datetime(wide_df['timestamp']) | ||
>>> for col in ["A/code", "B/code", "A/value", "B/value"]: | ||
... wide_df[col] = pd.arrays.SparseArray(wide_df[col]) | ||
>>> feature_columns = ["A/code/count", "B/code/count", "A/value/sum", "B/value/sum"] | ||
>>> aggregations = ["code/count", "value/sum"] | ||
>>> window_sizes = ["full", "1d"] | ||
>>> generate_summary(feature_columns, wide_df, window_sizes, aggregations)[ | ||
... ["1d/A/code/count", "full/B/code/count", "full/B/value/sum"]] | ||
1d/A/code/count full/B/code/count full/B/value/sum | ||
0 NaN 1.0 0 | ||
1 NaN 1.0 0 | ||
2 NaN 1.0 0 | ||
0 NaN 1.0 0 | ||
0 NaN NaN 0 | ||
1 NaN NaN 0 | ||
2 NaN NaN 0 | ||
0 NaN NaN 0 | ||
0 0 NaN 0 | ||
1 1.0 NaN 0 | ||
2 2.0 NaN 0 | ||
0 0 NaN 0 | ||
0 NaN NaN 0 | ||
1 NaN NaN 0 | ||
2 NaN NaN 0 | ||
0 NaN NaN 0 | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure all specified feature columns exist in the final output, adding missing ones with default values.
+ # Ensure all specified feature columns are present in the final output
+ missing_cols = [col for col in feature_columns if col not in df.columns]
+ if missing_cols:
+ for col in missing_cols:
+ df[col] = pd.arrays.SparseArray([0]*len(df), fill_value=0)
Committable suggestion was skipped due low confidence.
def sparse_aggregate(sparse_matrix, agg): | ||
if agg == "sum": | ||
merged_matrix = sparse_matrix.sum(axis=0, dtype=sparse_matrix.dtype) | ||
elif agg == "min": | ||
merged_matrix = sparse_matrix.min(axis=0) | ||
elif agg == "max": | ||
merged_matrix = sparse_matrix.max(axis=0) | ||
elif agg == "sum_sqd": | ||
merged_matrix = sparse_matrix.power(2).sum(axis=0, dtype=sparse_matrix.dtype) | ||
elif agg == "count": | ||
merged_matrix = sparse_matrix.getnnz(axis=0) | ||
else: | ||
raise ValueError(f"Aggregation method '{agg}' not implemented.") | ||
return merged_matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhance error handling by providing a list of valid aggregation methods in the error message.
- raise ValueError(f"Aggregation method '{agg}' not implemented.")
+ raise ValueError(f"Aggregation method '{agg}' not implemented. Valid methods are: sum, min, max, sum_sqd, count.")
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
def sparse_aggregate(sparse_matrix, agg): | |
if agg == "sum": | |
merged_matrix = sparse_matrix.sum(axis=0, dtype=sparse_matrix.dtype) | |
elif agg == "min": | |
merged_matrix = sparse_matrix.min(axis=0) | |
elif agg == "max": | |
merged_matrix = sparse_matrix.max(axis=0) | |
elif agg == "sum_sqd": | |
merged_matrix = sparse_matrix.power(2).sum(axis=0, dtype=sparse_matrix.dtype) | |
elif agg == "count": | |
merged_matrix = sparse_matrix.getnnz(axis=0) | |
else: | |
raise ValueError(f"Aggregation method '{agg}' not implemented.") | |
return merged_matrix | |
raise ValueError(f"Aggregation method '{agg}' not implemented. Valid methods are: sum, min, max, sum_sqd, count.") |
def time_aggd_col_alias_fntr(window_size: str, agg: str) -> Callable[[str], str]: | ||
if agg is None: | ||
raise ValueError("Aggregation type 'agg' must be provided") | ||
|
||
def f(c: str) -> str: | ||
if c in ["patient_id", "timestamp"]: | ||
return c | ||
else: | ||
return "/".join([window_size] + c.split("/") + [agg]) | ||
|
||
return f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure proper error handling by replacing assertions with exceptions.
- if agg is None:
- raise ValueError("Aggregation type 'agg' must be provided")
+ if not isinstance(agg, str) or not agg:
+ raise ValueError("Aggregation type 'agg' must be provided and must be a non-empty string")
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
def time_aggd_col_alias_fntr(window_size: str, agg: str) -> Callable[[str], str]: | |
if agg is None: | |
raise ValueError("Aggregation type 'agg' must be provided") | |
def f(c: str) -> str: | |
if c in ["patient_id", "timestamp"]: | |
return c | |
else: | |
return "/".join([window_size] + c.split("/") + [agg]) | |
return f | |
def time_aggd_col_alias_fntr(window_size: str, agg: str) -> Callable[[str], str]: | |
if not isinstance(agg, str) or not agg: | |
raise ValueError("Aggregation type 'agg' must be provided and must be a non-empty string") | |
def f(c: str) -> str: | |
if c in ["patient_id", "timestamp"]: | |
return c | |
else: | |
return "/".join([window_size] + c.split("/") + [agg]) | |
return f |
from collections.abc import Callable | ||
|
||
import pandas as pd | ||
|
||
pd.set_option("compute.use_numba", True) | ||
import numpy as np | ||
import polars as pl | ||
from loguru import logger | ||
from scipy.sparse import coo_array, csr_array, sparray, vstack | ||
|
||
from MEDS_tabular_automl.generate_ts_features import get_feature_names, get_flat_ts_rep | ||
from MEDS_tabular_automl.utils import CODE_AGGREGATIONS, VALUE_AGGREGATIONS, load_tqdm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reorganize module-level imports to the top of the file.
+ import numpy as np
+ import pandas as pd
+ import polars as pl
+ from collections.abc import Callable
+ from loguru import logger
+ from scipy.sparse import coo_array, csr_array, sparray, vstack
+ from MEDS_tabular_automl.generate_ts_features import get_feature_names, get_flat_ts_rep
+ from MEDS_tabular_automl.utils import CODE_AGGREGATIONS, VALUE_AGGREGATIONS, load_tqdm
- from collections.abc import Callable
- import pandas as pd
- pd.set_option("compute.use_numba", True)
- import numpy as np
- import polars as pl
- from loguru import logger
- from scipy.sparse import coo_array, csr_array, sparray, vstack
- from MEDS_tabular_automl.generate_ts_features import get_feature_names, get_flat_ts_rep
- from MEDS_tabular_automl.utils import CODE_AGGREGATIONS, VALUE_AGGREGATIONS, load_tqdm
Committable suggestion was skipped due low confidence.
Tools
Ruff
6-6: Module level import not at top of file
7-7: Module level import not at top of file
8-8: Module level import not at top of file
9-9: Module level import not at top of file
11-11: Module level import not at top of file
12-12: Module level import not at top of file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- scripts/summarize_over_windows.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- scripts/summarize_over_windows.py
…xed bugs related to event ids and column ids being incorrect.
…toML into esgpt_caching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Outside diff range and nitpick comments (2)
hf_cohort/aces_task_extraction.py (1)
36-76
: Consider enhancing logging for better traceability of the data processing workflow.Adding more detailed logging statements can help in debugging and understanding the flow of data, especially in a complex data processing pipeline like this one.
scripts/xgboost_sweep.py (1)
276-284
: Clarify the choice between in-memory and iterator-based data structures in the_build
method.Consider adding comments to explain why one might choose to build data structures in memory versus using iterators, as this could help future maintainers understand the design decisions.
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (8)
- configs/xgboost_sweep.yaml (1 hunks)
- hf_cohort/aces_task_extraction.py (1 hunks)
- scripts/tabularize_static.py (1 hunks)
- scripts/xgboost_sweep.py (1 hunks)
- src/MEDS_tabular_automl/file_name.py (1 hunks)
- src/MEDS_tabular_automl/generate_static_features.py (1 hunks)
- src/MEDS_tabular_automl/utils.py (1 hunks)
- tests/test_tabularize.py (1 hunks)
Files skipped from review as they are similar to previous changes (5)
- configs/xgboost_sweep.yaml
- scripts/tabularize_static.py
- src/MEDS_tabular_automl/file_name.py
- src/MEDS_tabular_automl/generate_static_features.py
- src/MEDS_tabular_automl/utils.py
Additional context used
Ruff
tests/test_tabularize.py
5-5: Module level import not at top of file
6-6: Module level import not at top of file
7-7: Module level import not at top of file
8-8: Module level import not at top of file
10-10: Module level import not at top of file
11-11: Module level import not at top of file
12-12: Module level import not at top of file
14-14: Module level import not at top of file
15-20: Module level import not at top of file
21-21: Module level import not at top of file
22-22: Module level import not at top of file
23-23: Module level import not at top of file
24-24: Module level import not at top of file
Additional comments not posted (4)
hf_cohort/aces_task_extraction.py (1)
13-20
: LGTM! The functionget_events_df
is well-implemented with clear logic for filtering and processing data frames.scripts/xgboost_sweep.py (3)
21-48
: RefactorIterator
class to improve readability and maintainability.The
Iterator
class is quite large and handles multiple responsibilities. Consider breaking it down into smaller classes or functions, each handling a specific part of the data processing, such as loading data, filtering, and iteration management.
270-271
: Add error handling and validation checks intrain
andevaluate
methods.+ # TODO: add in eval, early stopping, etc. + # TODO: check for Nan and inf in labels!These TODOs are crucial for robust model training. Implementing early stopping, evaluation during training, and checks for NaN or infinite values in labels will help prevent overfitting and ensure the model's reliability.
319-346
: Ensure consistent logging and error handling in thexgboost
function.Consider adding more detailed logging at each step of the model's configuration, training, and evaluation to help trace the execution flow and identify potential issues. Also, include error handling to manage exceptions that may occur during the model's operation.
def get_unique_time_events_df(events_df: pl.DataFrame): | ||
"""Updates Events DataFrame to have unique timestamps and sorted by patient_id and timestamp.""" | ||
assert events_df.select(pl.col("timestamp")).null_count().collect().item() == 0 | ||
# Check events_df is sorted - so it aligns with the ts_matrix we generate later in the pipeline | ||
events_df = ( | ||
events_df.drop_nulls("timestamp") | ||
.select(pl.col(["patient_id", "timestamp"])) | ||
.unique(maintain_order=True) | ||
) | ||
assert events_df.sort(by=["patient_id", "timestamp"]).collect().equals(events_df.collect()) | ||
return events_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider replacing assertions with explicit error handling for production environments.
Assertions are generally disabled in production Python environments (using -O
flag), which could lead to unhandled errors. Consider using conditional checks and raising exceptions instead.
tests/test_tabularize.py
Outdated
|
||
# Store MEDS outputs | ||
for split, data in MEDS_OUTPUTS.items(): | ||
file_path = MEDS_cohort_dir / "final_cohort" / f"{split}.parquet" | ||
file_path.parent.mkdir(exist_ok=True) | ||
df = pl.read_csv(StringIO(data)) | ||
df.with_columns(pl.col("timestamp").str.to_datetime("%Y-%m-%dT%H:%M:%S.%f")).write_parquet( | ||
file_path | ||
) | ||
|
||
# Check the files are not empty | ||
meds_files = f_name_resolver.list_meds_files() | ||
assert len(meds_files) == 4, "MEDS Data Files Should be 4!" | ||
for f in meds_files: | ||
assert pl.read_parquet(f).shape[0] > 0, "MEDS Data Tabular Dataframe Should not be Empty!" | ||
|
||
split_json = json.load(StringIO(SPLITS_JSON)) | ||
splits_fp = MEDS_cohort_dir / "splits.json" | ||
json.dump(split_json, splits_fp.open("w")) | ||
logger.info("caching flat representation of MEDS data") | ||
store_columns(cfg) | ||
assert (tabularized_data_dir / "config.yaml").is_file() | ||
assert (tabularized_data_dir / "feature_columns.json").is_file() | ||
assert (tabularized_data_dir / "feature_freqs.json").is_file() | ||
|
||
feature_columns = json.load(open(f_name_resolver.get_feature_columns_fp())) | ||
assert get_feature_names("code/count", feature_columns) == sorted(CODE_COLS) | ||
assert get_feature_names("static/present", feature_columns) == sorted(STATIC_PRESENT_COLS) | ||
assert get_feature_names("static/first", feature_columns) == sorted(STATIC_FIRST_COLS) | ||
for value_agg in VALUE_AGGREGATIONS: | ||
assert get_feature_names(value_agg, feature_columns) == sorted(VALUE_COLS) | ||
|
||
# Check Static File Generation | ||
tabularize_static_data(cfg) | ||
actual_files = [str(Path(*f.parts[-5:])) for f in f_name_resolver.list_static_files()] | ||
assert set(actual_files) == set(EXPECTED_STATIC_FILES) | ||
# Check the files are not empty | ||
for f in f_name_resolver.list_static_files(): | ||
static_matrix = load_matrix(f) | ||
assert static_matrix.shape[0] > 0, "Static Data Tabular Dataframe Should not be Empty!" | ||
expected_num_cols = len(get_feature_names(f"static/{f.stem}", feature_columns)) | ||
logger.info((static_matrix.shape[1], expected_num_cols)) | ||
logger.info(f_name_resolver.list_static_files()) | ||
assert static_matrix.shape[1] == expected_num_cols, ( | ||
f"Static Data Tabular Dataframe Should have {expected_num_cols}" | ||
f"Columns but has {static_matrix.shape[1]}!" | ||
) | ||
static_first_fp = f_name_resolver.get_flat_static_rep("tuning", "0", "static/first") | ||
static_present_fp = f_name_resolver.get_flat_static_rep("tuning", "0", "static/present") | ||
assert ( | ||
load_matrix(static_first_fp).shape[0] == load_matrix(static_present_fp).shape[0] | ||
), "static data first and present aggregations have different numbers of rows" | ||
|
||
summarize_ts_data_over_windows(cfg) | ||
# confirm summary files exist: | ||
output_files = f_name_resolver.list_ts_files() | ||
f_name_resolver.list_ts_files() | ||
actual_files = [str(Path(*f.parts[-5:])) for f in output_files] | ||
|
||
assert set(actual_files) == set(SUMMARIZE_EXPECTED_FILES) | ||
for f in output_files: | ||
sparse_array = load_matrix(f) | ||
assert sparse_array.shape[0] > 0 | ||
assert sparse_array.shape[1] > 0 | ||
ts_code_fp = f_name_resolver.get_flat_ts_rep("tuning", "0", "365d", "code/count") | ||
ts_value_fp = f_name_resolver.get_flat_ts_rep("tuning", "0", "365d", "value/sum") | ||
assert ( | ||
load_matrix(ts_code_fp).shape[0] == load_matrix(ts_value_fp).shape[0] | ||
), "time series code and value have different numbers of rows" | ||
assert ( | ||
load_matrix(static_first_fp).shape[0] == load_matrix(ts_value_fp).shape[0] | ||
), "static data and time series have different numbers of rows" | ||
|
||
# Create fake labels | ||
for f in f_name_resolver.list_meds_files(): | ||
df = pl.read_parquet(f) | ||
df = get_events_df(df, feature_columns) | ||
pseudo_labels = pl.Series(([0, 1] * df.shape[0])[: df.shape[0]]) | ||
df = df.with_columns(pl.Series(name="label", values=pseudo_labels)) | ||
df = df.select(pl.col(["patient_id", "timestamp", "label"])) | ||
df = df.unique(subset=["patient_id", "timestamp"]) | ||
df = df.with_row_index("event_id") | ||
|
||
split = f.parent.stem | ||
shard_num = f.stem | ||
out_f = f_name_resolver.get_label(split, shard_num) | ||
out_f.parent.mkdir(parents=True, exist_ok=True) | ||
df.write_parquet(out_f) | ||
model_dir = Path(d) / "save_model" | ||
xgboost_config_kwargs = { | ||
"model_dir": str(model_dir.resolve()), | ||
"hydra.mode": "MULTIRUN", | ||
} | ||
xgboost_config_kwargs = {**tabularize_config_kwargs, **xgboost_config_kwargs} | ||
with initialize(version_base=None, config_path="../configs/"): # path to config.yaml | ||
overrides = [f"{k}={v}" for k, v in xgboost_config_kwargs.items()] | ||
cfg = compose(config_name="xgboost_sweep", overrides=overrides) # config.yaml | ||
xgboost(cfg) | ||
output_files = list(model_dir.glob("*/*/*_model.json")) | ||
assert len(output_files) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor test_tabularize
into smaller, more focused tests.
This function covers multiple aspects of the data processing workflow, including setup, execution, and verification. Splitting this into smaller tests would improve maintainability and make it easier to identify issues with specific parts of the data processing pipeline.
import rootutils | ||
|
||
root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True) | ||
|
||
import json | ||
import tempfile | ||
from io import StringIO | ||
from pathlib import Path | ||
|
||
import polars as pl | ||
from hydra import compose, initialize | ||
from loguru import logger | ||
|
||
from MEDS_tabular_automl.file_name import FileNameResolver | ||
from MEDS_tabular_automl.utils import ( | ||
VALUE_AGGREGATIONS, | ||
get_events_df, | ||
get_feature_names, | ||
load_matrix, | ||
) | ||
from scripts.identify_columns import store_columns | ||
from scripts.summarize_over_windows import summarize_ts_data_over_windows | ||
from scripts.tabularize_static import tabularize_static_data | ||
from scripts.xgboost_sweep import xgboost |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reorganize import statements to the top of the file to follow Python's best practices.
+ import json
+ import tempfile
+ from io import StringIO
+ from pathlib import Path
+ import polars as pl
+ from hydra import compose, initialize
+ from loguru import logger
+ from MEDS_tabular_automl.file_name import FileNameResolver
+ from MEDS_tabular_automl.utils import VALUE_AGGREGATIONS, get_feature_names, load_matrix
+ from scripts.identify_columns import store_columns
+ from scripts.summarize_over_windows import summarize_ts_data_over_windows
+ from scripts.tabularize_static import tabularize_static_data
- import rootutils
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
import rootutils | |
root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True) | |
import json | |
import tempfile | |
from io import StringIO | |
from pathlib import Path | |
import polars as pl | |
from hydra import compose, initialize | |
from loguru import logger | |
from MEDS_tabular_automl.file_name import FileNameResolver | |
from MEDS_tabular_automl.utils import ( | |
VALUE_AGGREGATIONS, | |
get_events_df, | |
get_feature_names, | |
load_matrix, | |
) | |
from scripts.identify_columns import store_columns | |
from scripts.summarize_over_windows import summarize_ts_data_over_windows | |
from scripts.tabularize_static import tabularize_static_data | |
from scripts.xgboost_sweep import xgboost | |
import json | |
import tempfile | |
from io import StringIO | |
from pathlib import Path | |
import polars as pl | |
from hydra import compose, initialize | |
from loguru import logger | |
from MEDS_tabular_automl.file_name import FileNameResolver | |
from MEDS_tabular_automl.utils import VALUE_AGGREGATIONS, get_feature_names, load_matrix | |
from scripts.identify_columns import store_columns | |
from scripts.summarize_over_windows import summarize_ts_data_over_windows | |
from scripts.tabularize_static import tabularize_static_data | |
import rootutils | |
root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True) |
Tools
Ruff
5-5: Module level import not at top of file
6-6: Module level import not at top of file
7-7: Module level import not at top of file
8-8: Module level import not at top of file
10-10: Module level import not at top of file
11-11: Module level import not at top of file
12-12: Module level import not at top of file
14-14: Module level import not at top of file
15-20: Module level import not at top of file
21-21: Module level import not at top of file
22-22: Module level import not at top of file
23-23: Module level import not at top of file
24-24: Module level import not at top of file
@hydra.main(version_base=None, config_path="../configs", config_name="xgboost_sweep") | ||
def xgboost(cfg: DictConfig) -> float: | ||
"""Optimize the model based on the provided configuration. | ||
|
||
Args: | ||
- cfg (DictConfig): Configuration dictionary. | ||
|
||
Returns: | ||
- float: Evaluation result. | ||
""" | ||
logger.debug("Initializing XGBoost model") | ||
model = XGBoostModel(cfg) | ||
logger.debug("Training XGBoost model") | ||
time = datetime.now() | ||
model.train() | ||
logger.debug(f"Training took {datetime.now() - time}") | ||
# save model | ||
save_dir = ( | ||
Path(cfg.model_dir) | ||
/ "_".join(map(str, cfg.window_sizes)) | ||
/ "_".join([agg.replace("/", "") for agg in cfg.aggs]) | ||
) | ||
save_dir.mkdir(parents=True, exist_ok=True) | ||
|
||
model.model.save_model(save_dir / f"{np.random.randint(100000, 999999)}_model.json") | ||
|
||
return model.evaluate() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhance the xgboost
function with detailed logging and error handling.
Consider adding more detailed logging at each step of the model's configuration, training, and evaluation to help trace the execution flow and identify potential issues. Also, include error handling to manage exceptions that may occur during the model's operation.
Enhance the xgboost
function with detailed logging and error handling.
Consider adding more detailed logging at each step of the model's configuration, training, and evaluation to help trace the execution flow and identify potential issues. Also, include error handling to manage exceptions that may occur during the model's operation.
def evaluate(self) -> float: | ||
"""Evaluate the model on the test set. | ||
|
||
Returns: | ||
- float: Evaluation metric (mae). | ||
""" | ||
# TODO: Figure out exactly what we want to do here | ||
|
||
y_pred = self.model.predict(self.dtest) | ||
y_true = self.dtest.get_label() | ||
return mean_absolute_error(y_true, y_pred) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve robustness of the evaluate
method with error handling and detailed logging.
Consider adding try-except blocks to handle potential exceptions during model evaluation. Additionally, include detailed logging to trace the computation of the evaluation metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Outside diff range and nitpick comments (1)
src/MEDS_tabular_automl/generate_summarized_reps.py (1)
83-109
: Consider enhancing the documentation to clarify the operations performed during the rolling window aggregation.
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- src/MEDS_tabular_automl/generate_summarized_reps.py (1 hunks)
Additional context used
Ruff
src/MEDS_tabular_automl/generate_summarized_reps.py
6-6: Module level import not at top of file
7-7: Module level import not at top of file
8-8: Module level import not at top of file
9-9: Module level import not at top of file
11-11: Module level import not at top of file
12-12: Module level import not at top of file
Additional comments not posted (8)
src/MEDS_tabular_automl/generate_summarized_reps.py (8)
15-25
: LGTM! The function correctly handles the case whereagg
isNone
and appropriately transforms column names.
28-41
: LGTM! The function handles various aggregation methods and maintains type consistency where needed.
112-124
: LGTM! The function correctly calculates indices for rolling windows and handles special cases appropriately.
127-160
: LGTM! The function efficiently aggregates the matrix based on window indices and optionally displays progress.
163-227
: LGTM! The function correctly applies aggregation to the dataframe and includes comprehensive documentation with examples.
230-278
: LGTM! The function validates the aggregation type and processes the data frame to generate a summary.
281-349
: LGTM! The function processes the data frame to apply specified aggregations and ensures all required feature columns are present in the final output.
351-375
: LGTM! The main block correctly sets up the environment and processes the data using the defined functions.
def sum_merge_timestamps(df, sparse_matrix, agg): | ||
"""Groups by timestamp and combines rows that are on the same date. | ||
|
||
The combining is done by summing the rows in the sparse matrix that correspond to the same date. | ||
|
||
Args: | ||
df (DataFrame): The DataFrame with 'timestamp' and 'patient_id'. | ||
sparse_matrix (csr_matrix): The corresponding sparse matrix with data. | ||
agg (str): Aggregation method, currently only 'sum' is implemented. | ||
|
||
Returns: | ||
DataFrame, csr_matrix: Tuple containing the DataFrame with aggregated timestamps and the corresponding | ||
sparse matrix. | ||
""" | ||
# Assuming 'timestamp' is already sorted; if not, uncomment the next line: | ||
# df = df.sort_values(by='timestamp') | ||
|
||
# Group by timestamp and sum the data | ||
grouped = df.groupby("timestamp") | ||
indices = grouped.indices | ||
|
||
# Create a new sparse matrix with summed rows per unique timestamp | ||
patient_id = df["patient_id"].iloc[0] | ||
timestamps = [] | ||
output_matrix = csr_array((0, sparse_matrix.shape[1]), dtype=sparse_matrix.dtype) | ||
|
||
# Loop through each group and sum | ||
for timestamp, rows in indices.items(): | ||
# Combine the rows in the sparse matrix for the current group (respecting the aggregation being used) | ||
merged_matrix = sparse_aggregate(sparse_matrix[rows], agg) | ||
# Save the non-zero elements | ||
output_matrix = vstack([output_matrix, merged_matrix]) | ||
timestamps.extend([timestamp]) | ||
|
||
# Create output DataFrame | ||
out_df = pd.DataFrame({"patient_id": [patient_id] * len(timestamps), "timestamp": timestamps}) | ||
return out_df, output_matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider ensuring timestamps are sorted within the function to avoid potential issues if the assumption is not met externally.
- # Assuming 'timestamp' is already sorted; if not, uncomment the next line:
+ df = df.sort_values(by='timestamp')
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
def sum_merge_timestamps(df, sparse_matrix, agg): | |
"""Groups by timestamp and combines rows that are on the same date. | |
The combining is done by summing the rows in the sparse matrix that correspond to the same date. | |
Args: | |
df (DataFrame): The DataFrame with 'timestamp' and 'patient_id'. | |
sparse_matrix (csr_matrix): The corresponding sparse matrix with data. | |
agg (str): Aggregation method, currently only 'sum' is implemented. | |
Returns: | |
DataFrame, csr_matrix: Tuple containing the DataFrame with aggregated timestamps and the corresponding | |
sparse matrix. | |
""" | |
# Assuming 'timestamp' is already sorted; if not, uncomment the next line: | |
# df = df.sort_values(by='timestamp') | |
# Group by timestamp and sum the data | |
grouped = df.groupby("timestamp") | |
indices = grouped.indices | |
# Create a new sparse matrix with summed rows per unique timestamp | |
patient_id = df["patient_id"].iloc[0] | |
timestamps = [] | |
output_matrix = csr_array((0, sparse_matrix.shape[1]), dtype=sparse_matrix.dtype) | |
# Loop through each group and sum | |
for timestamp, rows in indices.items(): | |
# Combine the rows in the sparse matrix for the current group (respecting the aggregation being used) | |
merged_matrix = sparse_aggregate(sparse_matrix[rows], agg) | |
# Save the non-zero elements | |
output_matrix = vstack([output_matrix, merged_matrix]) | |
timestamps.extend([timestamp]) | |
# Create output DataFrame | |
out_df = pd.DataFrame({"patient_id": [patient_id] * len(timestamps), "timestamp": timestamps}) | |
return out_df, output_matrix | |
# Ensure 'timestamp' is sorted | |
df = df.sort_values(by='timestamp') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- hf_cohort/aces_task_extraction.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- hf_cohort/aces_task_extraction.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- hf_cohort/hf_cohort_e2e.sh (1 hunks)
Additional context used
Shellcheck
hf_cohort/hf_cohort_e2e.sh
[warning] 4-4: N_RUNS appears unused. Verify use (or export if used externally).
[warning] 5-5: OUTPUT_BASE appears unused. Verify use (or export if used externally).
[warning] 29-29: POLARS_MAX_THREADS appears unused. Verify use (or export if used externally).
[info] 52-52: Double quote to prevent globbing and word splitting.
Additional comments not posted (1)
hf_cohort/hf_cohort_e2e.sh (1)
10-10
: EnsureN_PARALLEL_WORKERS
is properly validated.It's important to validate the input for
N_PARALLEL_WORKERS
to ensure it's a positive integer. This prevents potential runtime errors or unexpected behaviors.Verification successful
The validation for
N_PARALLEL_WORKERS
is correctly implemented and ensures it is a positive integer.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that N_PARALLEL_WORKERS is a positive integer. if ! [[ "$1" =~ ^[0-9]+$ ]] ; then echo "Error: N_PARALLEL_WORKERS is not a positive integer." exit 1 fiLength of output: 146
hf_cohort/hf_cohort_e2e.sh
Outdated
echo "build_dataset.sh failed with status $cmd_exit_status." | ||
echo "Stderr from build_dataset.sh (see $LOG_DIR/cmd.stderr):" | ||
tail $LOG_DIR/cmd.stderr | ||
exit $cmd_exit_status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the shell syntax for robustness.
The exit command should be properly quoted to prevent globbing and word splitting, ensuring the script's robustness and security.
- exit $cmd_exit_status
+ exit "$cmd_exit_status"
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
exit $cmd_exit_status | |
exit "$cmd_exit_status" |
Tools
Shellcheck
[info] 52-52: Double quote to prevent globbing and word splitting.
echo "Running identify_columns.py: Caching feature names and frequencies." | ||
rm -rf $OUTPUT_DIR | ||
POLARS_MAX_THREADS=32 python scripts/identify_columns.py \ | ||
MEDS_cohort_dir=$MEDS_DIR \ | ||
tabularized_data_dir=$OUTPUT_DIR \ | ||
min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding error handling for the rm
and python
commands.
The script lacks error handling for critical operations like removing directories and running Python scripts. Adding error checks will improve the robustness of the script.
+ if ! rm -rf $OUTPUT_DIR; then
+ echo "Failed to remove $OUTPUT_DIR"
+ exit 1
+ fi
+ if ! POLARS_MAX_THREADS=32 python scripts/identify_columns.py \
+ MEDS_cohort_dir=$MEDS_DIR \
+ tabularized_data_dir=$OUTPUT_DIR \
+ min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"; then
+ echo "Python script failed"
+ exit 1
+ fi
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
echo "Running identify_columns.py: Caching feature names and frequencies." | |
rm -rf $OUTPUT_DIR | |
POLARS_MAX_THREADS=32 python scripts/identify_columns.py \ | |
MEDS_cohort_dir=$MEDS_DIR \ | |
tabularized_data_dir=$OUTPUT_DIR \ | |
min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS" | |
echo "Running identify_columns.py: Caching feature names and frequencies." | |
if ! rm -rf $OUTPUT_DIR; then | |
echo "Failed to remove $OUTPUT_DIR" | |
exit 1 | |
fi | |
if ! POLARS_MAX_THREADS=32 python scripts/identify_columns.py \ | |
MEDS_cohort_dir=$MEDS_DIR \ | |
tabularized_data_dir=$OUTPUT_DIR \ | |
min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS"; then | |
echo "Python script failed" | |
exit 1 | |
fi |
WINDOW_SIZES="window_sizes=[1d,7d,30d,365d,full]" | ||
AGGS="aggs=[static/present,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tip
Codebase Verification
The variables WINDOW_SIZES
and AGGS
are defined in the script hf_cohort/hf_cohort_e2e.sh
, but they appear to be empty when the script is executed. This suggests that the variables might not be properly exported or sourced before the format check.
- Ensure that
WINDOW_SIZES
andAGGS
are correctly set and exported before the format check in the script.
Analysis chain
Review the aggregation and window size parameters.
The parameters for window sizes and aggregations are critical for the script's operation. Ensure these are correctly formatted and match the expected patterns used in downstream processing.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the format of WINDOW_SIZES and AGGS.
if ! [[ "$WINDOW_SIZES" =~ ^window_sizes=\[.*\]$ ]] ; then
echo "Error: WINDOW_SIZES format is incorrect."
exit 1
fi
if ! [[ "$AGGS" =~ ^aggs=\[.*\]$ ]] ; then
echo "Error: AGGS format is incorrect."
exit 1
fi
Length of output: 134
Script:
#!/bin/bash
# Description: Print the values of WINDOW_SIZES and AGGS to verify their correctness.
echo "WINDOW_SIZES: $WINDOW_SIZES"
echo "AGGS: $AGGS"
Length of output: 61
Script:
#!/bin/bash
# Description: Search for the assignments of WINDOW_SIZES and AGGS in the codebase.
rg 'WINDOW_SIZES='
rg 'AGGS='
Length of output: 560
hf_cohort/hf_cohort_e2e.sh
Outdated
tabularized_data_dir=$OUTPUT_DIR \ | ||
min_code_inclusion_frequency=1 "$WINDOW_SIZES" do_overwrite=False "$AGGS" | ||
|
||
POLARS_MAX_THREADS=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tip
Codebase Verification
The variable POLARS_MAX_THREADS
is set to 1 in hf_cohort/hf_cohort_e2e.sh
but is not used afterward in the script. It is used with a value of 32 in other parts of the same script and in hf_cohort/hf_cohort_shard.sh
.
hf_cohort/hf_cohort_e2e.sh
:POLARS_MAX_THREADS=32
is used with Python scripts.hf_cohort/hf_cohort_shard.sh
:POLARS_MAX_THREADS=32
is used with Python scripts.
However, the specific instance of POLARS_MAX_THREADS=1
does not appear to be utilized.
- hf_cohort/hf_cohort_e2e.sh:29:
POLARS_MAX_THREADS=1
is set but not used.
Analysis chain
Verify the usage of POLARS_MAX_THREADS
.
The variable POLARS_MAX_THREADS
is set to 1 without subsequent usage, which might be an oversight or error. Confirm its intended use.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if POLARS_MAX_THREADS is used after being set.
grep -q 'POLARS_MAX_THREADS' hf_cohort/hf_cohort_e2e.sh
if [ $? -ne 0 ]; then
echo "POLARS_MAX_THREADS set but not used."
fi
Length of output: 73
Script:
#!/bin/bash
# Description: Check if POLARS_MAX_THREADS is used anywhere in the repository after being set.
grep -r 'POLARS_MAX_THREADS' .
Length of output: 905
Tools
Shellcheck
[warning] 29-29: POLARS_MAX_THREADS appears unused. Verify use (or export if used externally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- .github/workflows/tests.yaml (2 hunks)
Files skipped from review as they are similar to previous changes (1)
- .github/workflows/tests.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- .github/workflows/tests.yaml (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- .github/workflows/tests.yaml
…perties, but don't know what the desired behavior is for the error
Some fixes for pre-commit, testing, etc. NOT READY FOR MERGING YET
Pull Request Summary
This pull request details the modifications and enhancements made to the data processing and aggregation pipeline. Below is the list of tasks that have been implemented and those that are still in progress.
Completed Tasks
get_flat_rep_feature_cols
): This function processes aggregations and selects which columns receive which aggregations. The process is categorized into two main types:get_flat_static_rep
): Produce static representations of data which do not change over time or with conditions.Tasks in Progress
get_flat_ts_rep
): This function is intended to iterate through each shard to produce a raw representation of time series data.Summary by CodeRabbit
New Features
Chores