Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added docstrings for scripts/generate_permutations.py #23

Merged

Conversation

aleksiakolo
Copy link
Collaborator

@aleksiakolo aleksiakolo commented Jun 13, 2024

Summary by CodeRabbit

  • New Features

    • Introduced functions for converting data to sparse matrices and merging static and time-series data.
    • Added a new function to generate a DataFrame with code frequencies under different aggregations.
  • Refactor

    • Refactored functions to improve clarity and future-proofing.
    • Integrated logic directly into functions for better efficiency.
  • Tests

    • Reactivated previously skipped integration and tabularize tests.
  • Documentation

    • Enhanced docstrings and examples for better understanding and usability.

@aleksiakolo aleksiakolo requested a review from mmcdermott June 13, 2024 02:27
Copy link
Contributor

coderabbitai bot commented Jun 13, 2024

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The updates across various files in the src/MEDS_tabular_automl module include significant refactoring and improvements. Key changes involve renaming functions, adding type annotations, refining logic for better functionality, and enhancing documentation. Additionally, some test functions were uncommented to ensure comprehensive testing. These modifications aim to improve code clarity, maintainability, and functionality, especially in data processing and feature generation workflows.

Changes

Files/Directories Change Summary
src/MEDS_tabular_automl/describe_codes.py Replaced DictConfig with OmegaConf, renamed and updated functions, added new examples and TODOs for clarity.
src/MEDS_tabular_automl/generate_static_features.py Added new functions, modified existing ones, and adjusted parameters to handle sparse matrices and merge static/time-series data.
src/MEDS_tabular_automl/generate_summarized_reps.py Updated function signatures with type annotations, improved docstrings, and refined logic for aggregating and summarizing data.
src/MEDS_tabular_automl/scripts/describe_codes.py Refactored to integrate compute_fn logic directly into write_fn by calling compute_feature_frequencies.
src/MEDS_tabular_automl/scripts/generate_permutations.py Added type annotations, updated docstrings for clarity, and refined function examples.
tests/test_integration.py Uncommented the test_integration function by removing the return statement that was skipping the test.
tests/test_tabularize.py Uncommented the test_tabularize function to ensure it runs during testing.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Script as describe_codes.py
    participant Function as compute_feature_frequencies
    participant DataFrame as DataFrame

    User->>Script: Execute write_fn
    Script->>Function: Call compute_feature_frequencies
    Function->>DataFrame: Process DataFrame
    DataFrame-->>Function: Return frequencies
    Function-->>Script: Return frequency DataFrame
    Script-->>User: Output results
Loading

Poem

In the code where data flows,
Functions dance and logic grows.
Frequencies now clear and bright,
Sparsity handled with delight.
Tests awaken from their sleep,
In this code, improvements deep.
🌟✨🐇


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@aleksiakolo aleksiakolo changed the base branch from main to improve_docstrings_and_tests June 13, 2024 02:29
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.85%. Comparing base (4ce0238) to head (6d12091).

Additional details and impacted files
@@                      Coverage Diff                      @@
##           improve_docstrings_and_tests      #23   +/-   ##
=============================================================
  Coverage                         86.85%   86.85%           
=============================================================
  Files                                13       13           
  Lines                               898      898           
=============================================================
  Hits                                780      780           
  Misses                              118      118           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range and nitpick comments (2)
tests/test_integration.py (1)

Line range hint 5-33: Ensure module-level imports are at the top of the file to comply with PEP 8 standards.

- import json
- import subprocess
- import tempfile
- from io import StringIO
- from pathlib import Path
- import polars as pl
- from hydra import compose, initialize
- from test_tabularize import (
-     CODE_COLS,
-     EXPECTED_STATIC_FILES,
-     MEDS_OUTPUTS,
-     SPLITS_JSON,
-     STATIC_FIRST_COLS,
-     STATIC_PRESENT_COLS,
-     SUMMARIZE_EXPECTED_FILES,
-     VALUE_COLS,
- )
- from MEDS_tabular_automl.describe_codes import get_feature_columns
- from MEDS_tabular_automl.file_name import list_subdir_files
- from MEDS_tabular_automl.utils import (
-     VALUE_AGGREGATIONS,
-     get_events_df,
-     get_feature_names,
-     get_shard_prefix,
-     get_unique_time_events_df,
-     load_matrix,
- )
+ # All imports moved to the top of the file
tests/test_tabularize.py (1)

Line range hint 5-30: The imports are not at the top of the file, which is a common Python convention to improve readability and prevent potential issues with circular dependencies.

+ import rootutils
+ import json
+ import subprocess
+ import tempfile
+ from io import StringIO
+ from pathlib import Path
+ import polars as pl
+ from hydra import compose, initialize
+ from MEDS_tabular_automl.describe_codes import get_feature_columns
+ from MEDS_tabular_automl.file_name import list_subdir_files
+ from MEDS_tabular_automl.scripts import (
+     cache_task,
+     describe_codes,
+     launch_xgboost,
+     tabularize_static,
+     tabularize_time_series,
+ )
+ from MEDS_tabular_automl.utils import (
+     VALUE_AGGREGATIONS,
+     get_events_df,
+     get_feature_names,
+     get_shard_prefix,
+     get_unique_time_events_df,
+     load_matrix,
+ )
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 670f3bb and 6d12091.

Files selected for processing (7)
  • src/MEDS_tabular_automl/describe_codes.py (9 hunks)
  • src/MEDS_tabular_automl/generate_static_features.py (6 hunks)
  • src/MEDS_tabular_automl/generate_summarized_reps.py (6 hunks)
  • src/MEDS_tabular_automl/scripts/describe_codes.py (2 hunks)
  • src/MEDS_tabular_automl/scripts/generate_permutations.py (2 hunks)
  • tests/test_integration.py (1 hunks)
  • tests/test_tabularize.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • src/MEDS_tabular_automl/scripts/generate_permutations.py
Additional context used
Ruff
src/MEDS_tabular_automl/generate_summarized_reps.py

6-6: Module level import not at top of file (E402)


7-7: Module level import not at top of file (E402)


9-9: Module level import not at top of file (E402)


10-10: Module level import not at top of file (E402)


11-16: Module level import not at top of file (E402)

tests/test_integration.py

5-5: Module level import not at top of file (E402)


6-6: Module level import not at top of file (E402)


7-7: Module level import not at top of file (E402)


8-8: Module level import not at top of file (E402)


9-9: Module level import not at top of file (E402)


11-11: Module level import not at top of file (E402)


12-12: Module level import not at top of file (E402)


13-22: Module level import not at top of file (E402)


24-24: Module level import not at top of file (E402)


25-25: Module level import not at top of file (E402)


26-33: Module level import not at top of file (E402)

tests/test_tabularize.py

5-5: Module level import not at top of file (E402)


6-6: Module level import not at top of file (E402)


7-7: Module level import not at top of file (E402)


8-8: Module level import not at top of file (E402)


9-9: Module level import not at top of file (E402)


11-11: Module level import not at top of file (E402)


12-12: Module level import not at top of file (E402)


14-14: Module level import not at top of file (E402)


15-15: Module level import not at top of file (E402)


16-22: Module level import not at top of file (E402)


23-30: Module level import not at top of file (E402)

Additional comments not posted (13)
src/MEDS_tabular_automl/scripts/describe_codes.py (1)

76-76: Ensure correct function usage after refactoring.

src/MEDS_tabular_automl/generate_static_features.py (3)

28-38: The function convert_to_matrix is well-documented and the type annotations are clear.


170-184: The function get_flat_static_rep is well-documented and the type annotations are clear.


54-66: The function get_sparse_static_rep effectively merges static and time-series data. Ensure the assertions for data integrity are robust.

Verification successful

The function get_sparse_static_rep includes assertions to ensure data integrity:

  • assert static_df.select(pl.col("patient_id")).collect().to_series().is_sorted()
  • assert matrix.shape[1] == len(...)

These assertions help verify that the patient_id column in static_df is sorted and that the shape of the matrix matches the expected length.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the data integrity checks in `get_sparse_static_rep`.

# Test: Search for usage of `assert` in the function. Expect: Robust checks.
rg --type python $'assert' 'src/MEDS_tabular_automl/generate_static_features.py'

Length of output: 108



Script:

#!/bin/bash
# Description: Verify the data integrity checks in `get_sparse_static_rep`.

# Test: Search for usage of `assert` in the function. Expect: Robust checks.
rg 'assert' 'src/MEDS_tabular_automl/generate_static_features.py'

Length of output: 195

src/MEDS_tabular_automl/generate_summarized_reps.py (5)

168-191: The function generate_summary is well-documented. Ensure the aggregation types are correctly validated.

Verification successful

The code does include checks for unsupported aggregation types, raising a ValueError when an invalid aggregation method is encountered.

  • ValueError: If the aggregation method is not implemented.
  • raise ValueError(f"Aggregation method '{agg}' not implemented.")
  • ValueError: If the aggregation type is not supported.
  • raise ValueError(

These lines confirm that the aggregation types are correctly validated within the generate_summary function.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify aggregation type validation in `generate_summary`.

# Test: Search for aggregation type checks. Expect: Correct validation.
rg --type python $'ValueError' 'src/MEDS_tabular_automl/generate_summarized_reps.py'

Length of output: 112



Script:

#!/bin/bash
# Description: Verify aggregation type validation in `generate_summary`.

# Test: Search for aggregation type checks. Expect: Correct validation.
rg 'ValueError' 'src/MEDS_tabular_automl/generate_summarized_reps.py'

Length of output: 294


70-87: The function aggregate_matrix is well-documented. Ensure the type compatibility for aggregated matrices.

Verification successful

The function aggregate_matrix includes a type compatibility check by raising a TypeError if the matrix type is invalid. This ensures that the function handles type compatibility as expected.

  • src/MEDS_tabular_automl/generate_summarized_reps.py: raise TypeError(f"Invalid matrix type {type(agg_matrix)}")
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify type compatibility for aggregated matrices in `aggregate_matrix`.

# Test: Search for type checks in the function. Expect: Proper type handling.
rg --type python $'TypeError' 'src/MEDS_tabular_automl/generate_summarized_reps.py'

Length of output: 111



Script:

#!/bin/bash
# Description: Verify type compatibility for aggregated matrices in `aggregate_matrix`.

# Test: Search for type checks in the function. Expect: Proper type handling.
rg 'TypeError' --glob 'src/MEDS_tabular_automl/generate_summarized_reps.py'

Length of output: 346


126-149: The function compute_agg is well-documented and handles aggregation using rolling window indices. Ensure the DataFrame operations are optimized.

Verification successful

The compute_agg function does not appear to use DataFrame operations directly. Instead, it operates on sparse matrices and numpy arrays, which are generally efficient for the types of operations being performed (e.g., sum, min, max, power, getnnz, and concatenate). Therefore, the concern about optimizing DataFrame operations does not apply here.

  • The function uses efficient operations on sparse matrices and numpy arrays.
  • No DataFrame operations like groupby, apply, merge, join, or concat were found within the function.
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify DataFrame operations are optimized in `compute_agg`.

# Test: Search for DataFrame operations. Expect: Optimized operations.
rg --type python $'group_by' 'src/MEDS_tabular_automl/generate_summarized_reps.py'

Length of output: 110



Script:

#!/bin/bash
# Description: Verify DataFrame operations are optimized in `compute_agg`.

# Test: Search for DataFrame operations. Expect: Optimized operations.
rg 'groupby|apply|merge|join|concat' src/MEDS_tabular_automl/generate_summarized_reps.py

Length of output: 715


47-56: The function get_rolling_window_indicies effectively computes rolling window indices. Ensure the handling of the "full" window size is appropriate.

Verification successful

The handling of the "full" window size in the get_rolling_window_indicies function appears to be appropriate. When window_size is "full", it sets a large time delta of 150 years, which should cover the entire dataset.

  • if window_size == "full":
    • timedelta = pd.Timedelta(150 * 52, unit="W") # just use 150 years as time delta
  • else:
    • timedelta = pd.Timedelta(window_size)

This logic ensures that the "full" window size is handled correctly by setting an extensive time delta.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the handling of the "full" window size in `get_rolling_window_indicies`.

# Test: Search for usage of "full" in the function. Expect: Appropriate handling.
rg --type python $'"full"' 'src/MEDS_tabular_automl/generate_summarized_reps.py'

Length of output: 110



Script:

#!/bin/bash
# Description: Verify the handling of the "full" window size in `get_rolling_window_indicies`.

# Test: Search for usage of "full" within the function `get_rolling_window_indicies`. Expect: Appropriate handling.
rg '"full"' --glob 'src/MEDS_tabular_automl/generate_summarized_reps.py'

Length of output: 234



Script:

#!/bin/bash
# Description: Extract and review the relevant lines of code around the occurrences of "full" in `generate_summarized_reps.py`.

# Extract lines around the first occurrence of "full"
rg -A 5 -B 5 '"full"' src/MEDS_tabular_automl/generate_summarized_reps.py

Length of output: 789


19-31: The function sparse_aggregate is well-documented and handles various aggregation methods. Ensure all intended aggregation methods are implemented.

src/MEDS_tabular_automl/describe_codes.py (4)

4-4: The import of OmegaConf has been updated correctly to reflect the changes in the dependency.


38-66: The function convert_to_freq_dict is well-documented and handles edge cases by raising a ValueError if the expected columns are not present. This is a good practice for robust error handling.


208-208: Registering filter_to_codes as a new resolver in OmegaConf is a significant change. Ensure that this new resolver is used appropriately throughout the project.

Verification successful

The new resolver filter_to_codes is used appropriately in the following locations:

  • src/MEDS_tabular_automl/describe_codes.py
  • src/MEDS_tabular_automl/scripts/tabularize_static.py
  • src/MEDS_tabular_automl/configs/tabularization/default.yaml

This confirms that the resolver is integrated and utilized correctly within the project.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify that the new resolver `filter_to_codes` is used appropriately.

# Test: Search for resolver usage. Expect: Appropriate usage in configuration files or other Python scripts.
rg --type python $'filter_to_codes'

Length of output: 65



Script:

#!/bin/bash
# Description: Verify that the new resolver `filter_to_codes` is used appropriately.

# Test: Search for resolver usage. Expect: Appropriate usage in configuration files or other Python scripts.
rg 'filter_to_codes'

Length of output: 775


69-102: The function compute_feature_frequencies is well-documented and includes a comprehensive example. Ensure that the new functionality aligns with the rest of the module and that all dependent functions are updated accordingly.

Verification successful

The occurrences of compute_feature_frequencies in the codebase are as follows:

  • src/MEDS_tabular_automl/describe_codes.py: The function is defined and used within an example in the docstring.
  • src/MEDS_tabular_automl/scripts/describe_codes.py: The function is imported and used within the rwlock_wrap function call.
  • docs/source/generated/src.MEDS_tabular_automl.describe_codes.rst: The function is listed in the documentation.

Based on these findings, the function compute_feature_frequencies appears to be correctly implemented, imported, and documented. There are no indications of outdated or incorrect usage.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify that all dependent functions are updated to work with the new `compute_feature_frequencies` functionality.

# Test: Search for function usage. Expect: Only occurrences with updated functionality.
rg --type python $'compute_feature_frequencies'

Length of output: 77



Script:

#!/bin/bash
# Description: Verify that all dependent functions are updated to work with the new `compute_feature_frequencies` functionality.

# Test: Search for function usage. Expect: Only occurrences with updated functionality.
rg 'compute_feature_frequencies'

Length of output: 538



Script:

#!/bin/bash
# Description: Extract relevant lines for detailed inspection to verify the correct usage and updates of `compute_feature_frequencies`.

# Extract lines from `src/MEDS_tabular_automl/describe_codes.py`
echo "Occurrences in src/MEDS_tabular_automl/describe_codes.py:"
rg --context 5 'compute_feature_frequencies' src/MEDS_tabular_automl/describe_codes.py

# Extract lines from `src/MEDS_tabular_automl/scripts/describe_codes.py`
echo "Occurrences in src/MEDS_tabular_automl/scripts/describe_codes.py:"
rg --context 5 'compute_feature_frequencies' src/MEDS_tabular_automl/scripts/describe_codes.py

# Extract lines from `docs/source/generated/src.MEDS_tabular_automl.describe_codes.rst`
echo "Occurrences in docs/source/generated/src.MEDS_tabular_automl.describe_codes.rst:"
rg --context 5 'compute_feature_frequencies' docs/source/generated/src.MEDS_tabular_automl.describe_codes.rst

Length of output: 2582

"""
permutations = []
for i in range(1, len(list_of_options) + 1):
permutations.extend(list(combinations(list_of_options, r=i)))
format_print(sorted(permutations))


def main():
"""Generate all possible permutations of a list of options."""
def main() -> None:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See prior comment re -> None

@aleksiakolo aleksiakolo merged commit 43f4a3d into improve_docstrings_and_tests Jun 13, 2024
1 check passed
@mmcdermott mmcdermott deleted the scripts/generate_permutations branch August 10, 2024 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants