Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test cases and (eventually) fixes for #114 #117

Open
wants to merge 24 commits into
base: dev
Choose a base branch
from

Conversation

mmcdermott
Copy link
Owner

STILL IN PROGRESS. We just have failing test cases for now.

Copy link

coderabbitai bot commented Jun 22, 2024

Walkthrough

The updates primarily focus on ensuring consistent handling of subject IDs across the codebase by aligning data types and refining data processing. Additionally, enhancements to the caching mechanisms and shard creation process for deep learning representations have been implemented. A new script for building flat representation datasets has been added, and various test updates ensure compatibility and verify output formats.

Changes

Files Change Summary
EventStream/baseline/FT_task_baseline.py Adjusted filtering operations for subject IDs using ESD.subject_id_dtype and modified load_flat_rep function parameters.
EventStream/data/dataset_polars.py Updated subject ID handling, recalculated n_events_per_subject, adjusted filters, and added debug logs for data sizes.
EventStream/data/dataset_base.py Refined caching of parameters, modifications to deep learning shard creation, improved handling of parameters, and refined processing subject chunks.
EventStream/data/pytorch_dataset.py Updated method to ensure consistent data types when accessing subject IDs.
scripts/build_flat_reps.py Added new script for building flat representation datasets using a hydra config file, including necessary imports and function definitions.
tests/test_e2e_runs.py Added imports, new testing method for dataset outputs, and updated setUp and build_dataset methods to handle Parquet format outputs and perform assertions.
tests/data/test_pytorch_dataset.py Modified setUp method to convert subject IDs to strings when constructing the shards dictionary.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Script as build_flat_reps.py
    participant Dataset as EventStream.data.dataset_polars.Dataset
    participant CacheRep as cache_flat_representation
    
    User->>Script: Execute script with config
    Script->>Dataset: Load dataset
    Dataset->>CacheRep: Resolve cache parameters
    CacheRep->>Dataset: Cache flat representation
    Dataset-->>Script: Processed Dataset
    Script-->>User: Execution complete
Loading

This diagram outlines the flow of execution when building a flat representation dataset using the new script build_flat_reps.py. It shows the interactions between the user, script, dataset, and caching process.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Outside diff range and nitpick comments (4)
tests/test_e2e_runs.py (1)

Line range hint 5-14: Reorder Imports to Comply with PEP 8

The imports are scattered below the setup code, which violates PEP 8's guideline of having all module level imports at the top of the file. This should be corrected to improve readability and maintainability.

+ import json
+ import os
+ import subprocess
+ import unittest
+ from itertools import product
+ from pathlib import Path
+ from tempfile import TemporaryDirectory
+ from typing import Any
+ import polars as pl
import rootutils
root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True)
- import json
- import os
- import subprocess
- import unittest
- from itertools import product
- from pathlib import Path
- from tempfile import TemporaryDirectory
- from typing import Any
- import polars as pl
Tools
Ruff

5-5: Module level import not at top of file (E402)


6-6: Module level import not at top of file (E402)


7-7: Module level import not at top of file (E402)


8-8: Module level import not at top of file (E402)

EventStream/data/dataset_polars.py (3)

Line range hint 203-203: Use isinstance() for type checking instead of direct type comparison.

Direct comparison of types using == can lead to unexpected behavior and is not considered a best practice. Use isinstance() for checking object types as it is clearer and more reliable.

- if type(qq) is Path:
+ if isinstance(qq, Path):

Line range hint 1664-1664: Ensure loop variables are correctly scoped in lambda functions.

Using loop variables directly inside lambda functions can lead to unexpected behavior because the lambda may capture the variable by reference, not by value. This can be especially problematic in loops where the variable is bound to change. To fix this, you can use a default argument in the lambda to bind the current value of the loop variable at each iteration.

- .name.map(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),
+ .name.map(lambda c, m=m, key_prefix=key_prefix: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),

Also applies to: 1668-1668, 1670-1670, 1673-1673, 1674-1674, 1675-1675, 1682-1682


Line range hint 2029-2029: Optimize dictionary key check.

Use key in dict instead of key in dict.keys() for checking if a key is in a dictionary. The former is more Pythonic and efficient because it does not create a temporary list of keys.

- if feature not in cfg.vocabulary.idxmap.keys():
+ if feature not in cfg.vocabulary.idxmap:
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between ce9e2c3 and 1b4f0d8.

Files ignored due to path filters (6)
  • sample_data/dataset.yaml is excluded by !**/*.yaml
  • sample_data/dataset_parquet.yaml is excluded by !**/*.yaml
  • sample_data/raw_parquet/admit_vitals.parquet is excluded by !**/*.parquet, !**/*.parquet
  • sample_data/raw_parquet/labs.parquet is excluded by !**/*.parquet, !**/*.parquet
  • sample_data/raw_parquet/medications.parquet is excluded by !**/*.parquet, !**/*.parquet
  • sample_data/raw_parquet/subjects.parquet is excluded by !**/*.parquet, !**/*.parquet
Files selected for processing (2)
  • EventStream/data/dataset_polars.py (3 hunks)
  • tests/test_e2e_runs.py (5 hunks)
Additional context used
Ruff
tests/test_e2e_runs.py

5-5: Module level import not at top of file (E402)


6-6: Module level import not at top of file (E402)


7-7: Module level import not at top of file (E402)


8-8: Module level import not at top of file (E402)


9-9: Module level import not at top of file (E402)


10-10: Module level import not at top of file (E402)


11-11: Module level import not at top of file (E402)


12-12: Module level import not at top of file (E402)


14-14: Module level import not at top of file (E402)


16-16: Module level import not at top of file (E402)

EventStream/data/dataset_polars.py

203-203: Do not compare types, use isinstance() (E721)


1664-1664: Function definition does not bind loop variable m (B023)


1664-1664: Function definition does not bind loop variable key_prefix (B023)


1668-1668: Function definition does not bind loop variable m (B023)


1668-1668: Function definition does not bind loop variable val_prefix (B023)


1670-1670: Function definition does not bind loop variable m (B023)


1670-1670: Function definition does not bind loop variable val_prefix (B023)


1673-1673: Function definition does not bind loop variable m (B023)


1673-1673: Function definition does not bind loop variable val_prefix (B023)


1674-1674: Function definition does not bind loop variable m (B023)


1674-1674: Function definition does not bind loop variable val_prefix (B023)


1675-1675: Function definition does not bind loop variable m (B023)


1675-1675: Function definition does not bind loop variable val_prefix (B023)


1682-1682: Function definition does not bind loop variable m (B023)


2029-2029: Use key in dict instead of key in dict.keys() (SIM118)

Remove .keys()

Additional comments not posted (2)
tests/test_e2e_runs.py (2)

38-38: Approved addition of Parquet dataset paths

The addition of paths for Parquet datasets in the setUp method aligns with the PR's objective to enhance dataset handling capabilities.


121-131: Approved modifications to handle Parquet datasets in build_dataset method

The updates to the build_dataset method to include handling for both standard and Parquet dataset formats are appropriate and align with the PR's objectives.

tests/test_e2e_runs.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range and nitpick comments (5)
EventStream/baseline/FT_task_baseline.py (1)

Line range hint 369-369: Consider defining abstract methods in BaseSklearnModuleConfig.

BaseSklearnModuleConfig is an abstract base class without any abstract methods. This might be intentional, but if there are common methods that should be implemented by all subclasses, consider declaring them as abstract.

EventStream/data/dataset_polars.py (4)

Line range hint 693-715: Update subject event properties with enhanced data handling.

The method _update_subject_event_properties has been updated to handle subject_id data types more robustly, aligning with the changes in the schema. The logic for calculating n_events_per_subject is well-implemented using group_by and agg. However, consider adding error handling or logging for potential exceptions during these operations, especially when casting types or accessing dataframe columns that might not exist.


Line range hint 203-203: Replace type comparison with isinstance().

Direct type comparison using type() is generally discouraged in favor of isinstance(), which supports inheritance and is more flexible.

- if type(qq) is Path:
+ if isinstance(qq, Path):

Line range hint 1669-1687: Ensure loop variables are correctly bound in lambda expressions.

Several lambda expressions inside loops do not bind the loop variables m, key_prefix, and val_prefix correctly, which might lead to unexpected behavior due to late binding in Python. Consider using additional arguments in the lambda to correctly bind these variables at each iteration.

- .name.map(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count")
+ .name.map(lambda c, m=m, key_prefix=key_prefix: f"dynamic/{m}/{c.replace(key_prefix, '')}/count")

Line range hint 2034-2034: Optimize dictionary key existence check.

Using key in dict.keys() is redundant and less efficient than key in dict. Simplifying this can improve both readability and performance.

- if mod_col not in source_df.keys():
+ if mod_col not in source_df:
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1b4f0d8 and ace03dd.

Files ignored due to path filters (1)
  • configs/dataset_base.yaml is excluded by !**/*.yaml
Files selected for processing (4)
  • EventStream/baseline/FT_task_baseline.py (3 hunks)
  • EventStream/data/dataset_base.py (1 hunks)
  • EventStream/data/dataset_polars.py (7 hunks)
  • tests/test_e2e_runs.py (5 hunks)
Additional context used
Ruff
tests/test_e2e_runs.py

5-5: Module level import not at top of file (E402)


6-6: Module level import not at top of file (E402)


7-7: Module level import not at top of file (E402)


8-8: Module level import not at top of file (E402)


9-9: Module level import not at top of file (E402)


10-10: Module level import not at top of file (E402)


11-11: Module level import not at top of file (E402)


12-12: Module level import not at top of file (E402)


14-14: Module level import not at top of file (E402)


16-16: Module level import not at top of file (E402)

EventStream/baseline/FT_task_baseline.py

246-246: Do not compare types, use isinstance() (E721)


369-369: BaseSklearnModuleConfig is an abstract base class, but it has no abstract methods (B024)


397-397: No explicit stacklevel keyword argument found (B028)


545-545: Yoda conditions are discouraged, use frac_subjects > 0 instead (SIM300)

Replace Yoda condition with frac_subjects > 0

EventStream/data/dataset_base.py

1488-1488: Loop control variable m not used within loop body (B007)

Rename unused m to _m

EventStream/data/dataset_polars.py

203-203: Do not compare types, use isinstance() (E721)


1669-1669: Function definition does not bind loop variable m (B023)


1669-1669: Function definition does not bind loop variable key_prefix (B023)


1673-1673: Function definition does not bind loop variable m (B023)


1673-1673: Function definition does not bind loop variable val_prefix (B023)


1675-1675: Function definition does not bind loop variable m (B023)


1675-1675: Function definition does not bind loop variable val_prefix (B023)


1678-1678: Function definition does not bind loop variable m (B023)


1678-1678: Function definition does not bind loop variable val_prefix (B023)


1679-1679: Function definition does not bind loop variable m (B023)


1679-1679: Function definition does not bind loop variable val_prefix (B023)


1680-1680: Function definition does not bind loop variable m (B023)


1680-1680: Function definition does not bind loop variable val_prefix (B023)


1687-1687: Function definition does not bind loop variable m (B023)


2034-2034: Use key in dict instead of key in dict.keys() (SIM118)

Remove .keys()

Additional comments not posted (10)
tests/test_e2e_runs.py (3)

Line range hint 5-14: Reorder imports to conform to PEP 8 guidelines.

Imports should be at the top of the file unless there's a specific reason (e.g., avoiding circular dependencies). If the setup requires imports to be delayed, please document the reason to maintain clarity.
[REFACTOR_SUGGESTion]

+ import json
+ import os
+ import subprocess
+ import unittest
+ from itertools import product
+ from pathlib import Path
+ from tempfile import TemporaryDirectory
+ from typing import Any
+ import polars as pl
import rootutils
root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True)
Tools
Ruff

5-5: Module level import not at top of file (E402)


6-6: Module level import not at top of file (E402)


7-7: Module level import not at top of file (E402)


8-8: Module level import not at top of file (E402)


121-131: Ensure consistency in dataset building for Parquet format.

The method now handles building datasets from Parquet files. Verify that the configurations and paths used align with the expected dataset structures and that the Parquet dataset is correctly handled in subsequent tests.

Verification successful

Ensure consistency in dataset building for Parquet format.

The usage of dataset_from_parquet appears to be correctly handled in the tests. The path is used in assertions and commands to ensure the dataset is built and verified properly.

  • tests/test_e2e_runs.py:
    • "dataset_from_parquet",
    • f"save_dir={self.paths['dataset_from_parquet']}",
    • self._test_dataset_output((root / "sample_data" / "raw"), self.paths["dataset_from_parquet"])
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify that the Parquet dataset is correctly handled.

# Test: Search for usage of the "dataset_from_parquet" path in test assertions.
rg --type python $'dataset_from_parquet'

Length of output: 70



Script:

#!/bin/bash
# Description: Verify that the Parquet dataset is correctly handled.

# Test: Search for usage of the "dataset_from_parquet" path in test assertions.
rg --type py $'dataset_from_parquet'

Length of output: 304


38-38: Ensure the new dataset directory is utilized correctly.

The addition of "dataset_from_parquet" suggests handling a new data format. Confirm that this directory is properly integrated into the test workflows.

Verification successful

The new dataset directory is utilized correctly.

The search results show that the "dataset_from_parquet" directory is referenced multiple times in the tests/test_e2e_runs.py file, indicating its integration into the test workflows.

  • tests/test_e2e_runs.py:
    • Line: "dataset_from_parquet",
    • Line: f"save_dir={self.paths['dataset_from_parquet']}"
    • Line: self._test_dataset_output((root / "sample_data" / "raw"), self.paths["dataset_from_parquet"])
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify that the "dataset_from_parquet" directory is used in relevant test cases.

# Test: Search for usage of the "dataset_from_parquet" directory.
rg --type python $'dataset_from_parquet'

Length of output: 70



Script:

#!/bin/bash
# Description: Verify that the "dataset_from_parquet" directory is used in relevant test cases.

# Test: Search for usage of the "dataset_from_parquet" directory.
rg --type py 'dataset_from_parquet'

Length of output: 304

EventStream/baseline/FT_task_baseline.py (3)

Line range hint 246-246: Use isinstance for type checks.

Replace direct type comparisons with isinstance for better style and reliability.
[REFACTOR_SUGGESTion]

-        if type(window_sizes) is not list:
+        if not isinstance(window_sizes, list):

Line range hint 397-397: Specify stacklevel in warning calls.

When issuing warnings, it's helpful to specify the stacklevel so that the warning points to the caller's location in the stack, making it easier to locate the source of the issue.
[REFACTOR_SUGGESTion]

-                warnings.warn(f"Parameter {k} not in signature of {cls.__name__}. Dropping")
+                warnings.warn(f"Parameter {k} not in signature of {cls.__name__}. Dropping", stacklevel=2)

Line range hint 545-545: Correct the order in the conditional expression.

It's more readable to place the variable on the left side of the comparison.
[REFACTOR_SUGGESTion]

-            case float() as frac_subjects if 0 < frac_subjects and frac_subjects < 1:
+            case float() as frac_subjects if frac_subjects > 0 and frac_subjects < 1:
EventStream/data/dataset_base.py (4)

1370-1370: Approval of the do_overwrite handling in data caching.

The addition of the do_overwrite condition in the cache_deep_learning_representation method is a good practice. It allows better control over data management by providing an option to avoid overwriting existing files, which can prevent data loss.


1370-1370: Proper management of shard file existence and overwriting logic.

The logic to check the existence of the shard configuration file (shards_fp) and handle the do_overwrite flag is well-implemented. This ensures that the data integrity is maintained by not accidentally overwriting data unless explicitly instructed.


1370-1370: Well-structured method with robust error handling.

The cache_deep_learning_representation method is well-structured and includes comprehensive error handling and logging. This design enhances the robustness and maintainability of the method, making it easier to manage complex data operations and debug issues.


1370-1370: Consider renaming unused loop variable.

The static analysis tool flagged that the variable m in the loop on line 1488 is not used within the loop body. It's a good practice to replace unused loop variables with _ to indicate that they are intentionally unused.

- for m, cfg in self.measurement_configs.items():
+ for _, cfg in self.measurement_configs.items():

Likely invalid or redundant comment.

Comment on lines 731 to 742
logger.debug(
f"Converting inclusion targets of type {type(list(incl_targets)[0])} for "
f"{col} to {df.schema[col]}"
)
if isinstance(list(incl_targets)[0], str):
incl_list = pl.Series(list(incl_targets), dtype=pl.Utf8)
else:
incl_list = pl.Series(list(incl_targets), dtype=df.schema[col])

incl_list = incl_list.cast(df.schema[col])

logger.debug(f"Converted to Series of type {incl_list.dtype} with size {len(incl_list)}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enhance filtering logic with dynamic type casting.

The _filter_col_inclusion method introduces dynamic type casting based on the dataframe's schema, which is a robust approach for handling various data types. However, the error handling could be improved by providing more specific error messages related to the failure of type conversion, and possibly logging the values that caused the error.

Comment on lines 52 to 94
def _test_dataset_output(self, raw_data_root: Path, dataset_save_dir: Path):
DL_save_dir = dataset_save_dir / "DL_reps"

train_files = list((DL_save_dir / "train").glob("*.parquet"))
tuning_files = list((DL_save_dir / "tuning").glob("*.parquet"))
held_out_files = list((DL_save_dir / "held_out").glob("*.parquet"))

self.assertTrue(len(train_files) > 0)
self.assertTrue(len(tuning_files) > 0)
self.assertTrue(len(held_out_files) > 0)

train_DL_reps = pl.concat([pl.read_parquet(f, use_pyarrow=False) for f in train_files])
tuning_DL_reps = pl.concat([pl.read_parquet(f, use_pyarrow=False) for f in tuning_files])
held_out_DL_reps = pl.concat([pl.read_parquet(f, use_pyarrow=False) for f in held_out_files])

DL_shards = json.loads((dataset_save_dir / "DL_shards.json").read_text())

ESD_subjects = pl.read_parquet(dataset_save_dir / "subjects_df.parquet", use_pyarrow=False)

# Check that the DL shards are correctly partitioned.
all_subjects = set(ESD_subjects["subject_id"].unique().to_list())

self.assertEqual(len(all_subjects), len(ESD_subjects))

all_subj_in_DL_shards = set().union(*DL_shards.values())

self.assertEqual(all_subjects, all_subj_in_DL_shards)

train_DL_subjects = set(train_DL_reps["subject_id"].unique().to_list())
tuning_DL_subjects = set(tuning_DL_reps["subject_id"].unique().to_list())
held_out_DL_subjects = set(held_out_DL_reps["subject_id"].unique().to_list())

self.assertEqual(len(train_DL_subjects & tuning_DL_subjects), 0)
self.assertEqual(len(train_DL_subjects & held_out_DL_subjects), 0)
self.assertEqual(len(tuning_DL_subjects & held_out_DL_subjects), 0)

self.assertTrue(len(train_DL_subjects) > len(tuning_DL_subjects))
self.assertTrue(len(train_DL_subjects) > len(held_out_DL_subjects))

all_DL_subjects = train_DL_subjects | tuning_DL_subjects | held_out_DL_subjects

self.assertEqual(all_DL_subjects, all_subjects)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enhance documentation for _test_dataset_output.

The method checks the integrity of dataset outputs in Parquet format. Adding more detailed comments explaining the checks performed (e.g., why specific sets of subjects are compared) would enhance maintainability and readability.

+        # Check that the DL shards are correctly partitioned.
+        # Ensure no subjects overlap between training, tuning, and held-out datasets.
+        # Verify that the number of subjects in training is greater than in other datasets.

Committable suggestion was skipped due to low confidence.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (4)
EventStream/data/dataset_polars.py (4)

Line range hint 203-203: Replace type comparison with isinstance() for Pythonic code style.

Direct type comparisons are not recommended. Use isinstance() for checking object types to ensure compatibility with Python's type system.

- if type(qq) is Path:
+ if isinstance(qq, Path):

Line range hint 693-715: Consider enhancing the handling of subjects with no events.

The current implementation efficiently identifies subjects without events and assigns them a count of 0. However, this could be further optimized or simplified using direct polars operations, which might improve performance and readability.

-            subjects_with_no_events = set(subjects_df_subjects["subject_id"].to_list()) - self.subject_ids
-            for sid in subjects_with_no_events:
-                self.n_events_per_subject[sid] = 0
-            self.subject_ids.update(subjects_with_no_events)
+            # Assuming `subjects_df_subjects` and `self.subject_ids` are both sets of subject IDs
+            self.n_events_per_subject.update({sid: 0 for sid in subjects_df_subjects - self.subject_ids})
+            self.subject_ids.update(subjects_df_subjects - self.subject_ids)

Line range hint 2033-2033: Use direct dictionary membership test instead of .keys() for efficiency.

Directly checking membership in a dictionary is more efficient and Pythonic than using .keys().

- if mod_col not in source_df.keys():
+ if mod_col not in source_df:

Line range hint 1668-1686: Ensure loop variables are correctly bound in closures.

Function definitions within loops should explicitly bind loop variables to avoid referencing the wrong variable due to late binding.

- key_col = cs.starts_with(key_prefix)
- val_col = cs.starts_with(val_prefix).drop_nans().cast(pl.Float32)
+ key_col = cs.starts_with(key_prefix[:])
+ val_col = cs.starts_with(val_prefix[:]).drop_nans().cast(pl.Float32)
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between ace03dd and b41b014.

Files selected for processing (1)
  • EventStream/data/dataset_polars.py (7 hunks)
Additional context used
Ruff
EventStream/data/dataset_polars.py

203-203: Do not compare types, use isinstance() (E721)


1668-1668: Function definition does not bind loop variable m (B023)


1668-1668: Function definition does not bind loop variable key_prefix (B023)


1672-1672: Function definition does not bind loop variable m (B023)


1672-1672: Function definition does not bind loop variable val_prefix (B023)


1674-1674: Function definition does not bind loop variable m (B023)


1674-1674: Function definition does not bind loop variable val_prefix (B023)


1677-1677: Function definition does not bind loop variable m (B023)


1677-1677: Function definition does not bind loop variable val_prefix (B023)


1678-1678: Function definition does not bind loop variable m (B023)


1678-1678: Function definition does not bind loop variable val_prefix (B023)


1679-1679: Function definition does not bind loop variable m (B023)


1679-1679: Function definition does not bind loop variable val_prefix (B023)


1686-1686: Function definition does not bind loop variable m (B023)


2033-2033: Use key in dict instead of key in dict.keys() (SIM118)

Remove .keys()

Additional comments not posted (1)
EventStream/data/dataset_polars.py (1)

731-742: Enhance filtering logic with dynamic type casting.

The _filter_col_inclusion method introduces dynamic type casting based on the dataframe's schema, which is a robust approach for handling various data types. However, the error handling could be improved by providing more specific error messages related to the failure of type conversion, and possibly logging the values that caused the error.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Outside diff range and nitpick comments (4)
EventStream/data/pytorch_dataset.py (1)

Line range hint 209-209: Avoid shadowing the loop variable.

The variable n_events is used both as a loop control variable and an iterable, which can lead to confusing or erroneous behavior.

-            for i, (subj, n_events) in enumerate(zip(subject_ids, n_events)):
+            for i, (subj, num_events) in enumerate(zip(subject_ids, n_events)):
EventStream/data/dataset_base.py (1)

Line range hint 1485-1485: Rename unused loop variable.

The loop variable m is not used within the loop body. It can be renamed to _ to indicate that it is intentionally unused.

- for m, cfg in self.measurement_configs.items():
+ for _, cfg in self.measurement_configs.items():
EventStream/data/dataset_polars.py (2)

Line range hint 203-203: Use isinstance() for type checking instead of direct type comparison.

Using isinstance() is the recommended way to check types in Python as it supports inheritance.

- if type(qq) is Path:
+ if isinstance(qq, Path):

Line range hint 2039-2039: Simplify dictionary key existence check.

Use direct key checking in dictionaries without calling .keys().

- if mod_col not in source_df.keys():
+ if mod_col not in source_df:
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b41b014 and 9d703b7.

Files selected for processing (4)
  • EventStream/data/dataset_base.py (2 hunks)
  • EventStream/data/dataset_polars.py (7 hunks)
  • EventStream/data/pytorch_dataset.py (2 hunks)
  • tests/test_e2e_runs.py (5 hunks)
Additional context used
Ruff
tests/test_e2e_runs.py

5-5: Module level import not at top of file (E402)


6-6: Module level import not at top of file (E402)


7-7: Module level import not at top of file (E402)


8-8: Module level import not at top of file (E402)


9-9: Module level import not at top of file (E402)


10-10: Module level import not at top of file (E402)


11-11: Module level import not at top of file (E402)


12-12: Module level import not at top of file (E402)


14-14: Module level import not at top of file (E402)


16-16: Module level import not at top of file (E402)

EventStream/data/pytorch_dataset.py

209-209: Loop control variable n_events overrides iterable it iterates (B020)

EventStream/data/dataset_base.py

1485-1485: Loop control variable m not used within loop body (B007)

Rename unused m to _m

EventStream/data/dataset_polars.py

203-203: Do not compare types, use isinstance() (E721)


1674-1674: Function definition does not bind loop variable m (B023)


1674-1674: Function definition does not bind loop variable key_prefix (B023)


1678-1678: Function definition does not bind loop variable m (B023)


1678-1678: Function definition does not bind loop variable val_prefix (B023)


1680-1680: Function definition does not bind loop variable m (B023)


1680-1680: Function definition does not bind loop variable val_prefix (B023)


1683-1683: Function definition does not bind loop variable m (B023)


1683-1683: Function definition does not bind loop variable val_prefix (B023)


1684-1684: Function definition does not bind loop variable m (B023)


1684-1684: Function definition does not bind loop variable val_prefix (B023)


1685-1685: Function definition does not bind loop variable m (B023)


1685-1685: Function definition does not bind loop variable val_prefix (B023)


1692-1692: Function definition does not bind loop variable m (B023)


2039-2039: Use key in dict instead of key in dict.keys() (SIM118)

Remove .keys()

Additional comments not posted (4)
tests/test_e2e_runs.py (2)

52-108: Review of new method _test_dataset_output.

This method effectively tests the integrity of dataset outputs in Parquet format. It checks that the Deep Learning representations are correctly partitioned across training, tuning, and held-out datasets, and verifies the correct partitioning of subjects among these datasets. The method also ensures that no subjects overlap between these partitions and checks the counts of subjects in each partition.

The detailed comments added in the method enhance the readability and maintainability of the code. However, consider adding a comment explaining the purpose of casting subject IDs to a consistent type in line 79, as this is a subtle yet important step for ensuring data consistency.


135-145: Ensure consistent handling of dataset paths.

The method build_dataset and build_dataset_from_parquet both use the root variable to resolve paths. Ensure that the root variable is consistently set up across different environments to avoid path resolution issues.

Verification successful

Ensure consistent handling of dataset paths.

The root variable is consistently set up using rootutils.setup_root across different environments, as seen in the following files:

  • sample_data/generate_synthetic_data.py
  • tests/test_e2e_runs.py
  • sample_data/examine_synthetic_data.ipynb

This setup ensures that the root variable is correctly configured, addressing the path resolution concerns.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify that the `root` variable is set up correctly across different environments.

# Test: Check the setup configuration for `root`.
cat rootutils.py | grep 'setup_root'

Length of output: 81



Script:

#!/bin/bash
# Description: Search for the definition and setup of the `root` variable across the codebase.

# Search for the `root` variable definition
rg 'root\s*='

# Search for functions or methods that might setup `root`
rg 'def\s+\w+\(.*root.*\)'

Length of output: 505

EventStream/data/pytorch_dataset.py (1)

474-474: Include subject ID conditionally.

The code correctly includes the subject_id in the output dictionary if configured to do so. This approach is good for flexibility and data privacy, allowing the inclusion of sensitive identifiers only when necessary.

EventStream/data/dataset_polars.py (1)

714-715: Potential performance improvement in subject event count collection.

Consider using a more direct approach to compute the subjects_with_no_events by utilizing set operations on series directly from polars, which might be more efficient.
[PERFORMANCE]

-            subjects_df_subjects = self.subjects_df.select(pl.col("subject_id").cast(pl.Utf8))
-            subjects_with_no_events = set(subjects_df_subjects["subject_id"].to_list()) - self.subject_ids
+            subjects_with_no_events = set(self.subjects_df.get_column("subject_id").cast(pl.Utf8).to_list()) - self.subject_ids

@@ -1367,7 +1367,7 @@ def cache_deep_learning_representation(
NRT_dir = self.config.save_dir / "NRT_reps"

shards_fp = self.config.save_dir / "DL_shards.json"
if shards_fp.exists():
if shards_fp.exists() and not do_overwrite:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using json.load() instead of json.loads() with read_text().

The current implementation reads the entire file content into memory and then parses it. This can be optimized by directly loading the JSON content using json.load(), which streams the file content.

- shards = json.loads(shards_fp.read_text())
+ with open(shards_fp, 'r') as file:
+     shards = json.load(file)
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if shards_fp.exists() and not do_overwrite:
if shards_fp.exists() and not do_overwrite:
with open(shards_fp, 'r') as file:
shards = json.load(file)

tests/test_e2e_runs.py Show resolved Hide resolved
@@ -2,6 +2,7 @@

root = rootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=True)

import json
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorder module-level imports to the top of the file.

To comply with PEP 8, module-level imports should be at the top of the file.

+import json

Committable suggestion was skipped due to low confidence.

Tools
Ruff

5-5: Module level import not at top of file (E402)

@@ -461,7 +461,7 @@ def _seeded_getitem(self, idx: int) -> dict[str, list[float]]:

subject_id, st, end = self.index[idx]

shard = self.subj_map[subject_id]
shard = self.subj_map[str(subject_id)]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct the handling of subject ID.

The subject_id is being accessed as a string, which might not be consistent with other parts of the code where it could be treated as an integer or another type. This inconsistency can lead to errors or unexpected behavior.

-        shard = self.subj_map[str(subject_id)]
+        shard = self.subj_map[subject_id]  # Assuming subject_id is consistently used in its native type across the codebase.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
shard = self.subj_map[str(subject_id)]
shard = self.subj_map[subject_id] # Assuming subject_id is consistently used in its native type across the codebase.

Comment on lines 703 to 709
n_events = self.events_df.group_by("subject_id").agg(pl.len().alias("count"))
# here we cast to str to avoid issues with the subject_id column being various other types as we
# will eventually JSON serialize it.
n_events = n_events.with_columns(pl.col("subject_id").cast(pl.Utf8))
self.n_events_per_subject = {
subject_id: count for subject_id, count in zip(n_events["subject_id"], n_events["count"])
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor subject event properties calculation for clarity and efficiency.

The calculation of n_events_per_subject can be streamlined by using direct dictionary construction from n_events dataframe, which avoids the need for zipping and iterating over the series.

-            n_events = n_events.with_columns(pl.col("subject_id").cast(pl.Utf8))
-            self.n_events_per_subject = {
-                subject_id: count for subject_id, count in zip(n_events["subject_id"], n_events["count"])
-            }
+            self.n_events_per_subject = n_events.with_columns(pl.col("subject_id").cast(pl.Utf8)).to_dict(as_series=False)
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
n_events = self.events_df.group_by("subject_id").agg(pl.len().alias("count"))
# here we cast to str to avoid issues with the subject_id column being various other types as we
# will eventually JSON serialize it.
n_events = n_events.with_columns(pl.col("subject_id").cast(pl.Utf8))
self.n_events_per_subject = {
subject_id: count for subject_id, count in zip(n_events["subject_id"], n_events["count"])
}
n_events = self.events_df.group_by("subject_id").agg(pl.len().alias("count"))
# here we cast to str to avoid issues with the subject_id column being various other types as we
# will eventually JSON serialize it.
self.n_events_per_subject = n_events.with_columns(pl.col("subject_id").cast(pl.Utf8)).to_dict(as_series=False)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 9d703b7 and 7616598.

Files selected for processing (2)
  • EventStream/data/pytorch_dataset.py (2 hunks)
  • tests/data/test_pytorch_dataset.py (1 hunks)
Additional context used
Ruff
EventStream/data/pytorch_dataset.py

209-209: Loop control variable n_events overrides iterable it iterates (B020)

Additional comments not posted (2)
tests/data/test_pytorch_dataset.py (1)

323-323: Convert subject IDs to strings in shards dictionary.

The conversion of subject_id to a string ensures consistency with the expected data types in other parts of the application. This change aligns with modifications in other files where subject_id is treated as a string to maintain uniformity across the codebase.

EventStream/data/pytorch_dataset.py (1)

464-473: Ensure consistent handling of subject_id as a string.

The conversion of subject_id to a string before checking in subj_map ensures consistency with the new data type expectations. This change is crucial to avoid type mismatches which can lead to runtime errors. However, ensure that all parts of the application that interact with subject_id are updated to handle it as a string to maintain uniformity.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 7616598 and 674b50f.

Files selected for processing (2)
  • EventStream/data/dataset_base.py (1 hunks)
  • tests/test_e2e_runs.py (5 hunks)
Files skipped from review as they are similar to previous changes (2)
  • EventStream/data/dataset_base.py
  • tests/test_e2e_runs.py

Copy link

codecov bot commented Jun 23, 2024

Codecov Report

Attention: Patch coverage is 76.92308% with 9 lines in your changes missing coverage. Please review.

Project coverage is 86.06%. Comparing base (f84069c) to head (9f3ce52).
Report is 39 commits behind head on dev.

Current head 9f3ce52 differs from pull request most recent head 5150e05

Please upload reports for the commit 5150e05 to get more accurate results.

Files Patch % Lines
EventStream/data/pytorch_dataset.py 37.50% 5 Missing ⚠️
EventStream/baseline/FT_task_baseline.py 60.00% 2 Missing ⚠️
EventStream/data/dataset_polars.py 91.30% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #117      +/-   ##
==========================================
- Coverage   86.56%   86.06%   -0.50%     
==========================================
  Files          29       29              
  Lines        6556     6387     -169     
==========================================
- Hits         5675     5497     -178     
- Misses        881      890       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range and nitpick comments (7)
EventStream/baseline/FT_task_baseline.py (4)

Line range hint 50-81: Refactor the load_flat_rep function to improve readability and maintainability.

The function has a complex logic that could benefit from being broken down into smaller, more manageable functions. This would improve readability and maintainability.

+    def _load_params(self, flat_dir):
+        """Helper function to load parameters."""
+        params_fp = flat_dir / "params.json"
+        if not params_fp.is_file():
+            raise FileNotFoundError("Flat representation files haven't been written!")
+        with open(params_fp) as f:
+            return json.load(f)
+
-    params_fp = flat_dir / "params.json"
-    if not params_fp.is_file():
-        raise FileNotFoundError("Flat representation files haven't been written!")
-    with open(params_fp) as f:
-        params = json.load(f)

Line range hint 254-254: Use isinstance() instead of comparing types directly.

Direct type comparison is not recommended. Use isinstance() for checking object types to ensure compatibility with inheritance.

-        if type(window_sizes) is not list:
+        if not isinstance(window_sizes, list):

Line range hint 377-377: Consider defining abstract methods in BaseSklearnModuleConfig.

As BaseSklearnModuleConfig is an abstract base class, it should define at least one abstract method to justify its abstract nature.

+    @abstractmethod
+    def example_method(self):
+        """Example abstract method."""
+        pass

Line range hint 553-553: Adjust conditional expression for clarity.

Replace the Yoda condition with a more conventional expression to improve readability.

-    case float() as frac_subjects if 0 < frac_subjects and frac_subjects < 1:
+    case float() as frac_subjects if frac_subjects > 0 and frac_subjects < 1:
EventStream/data/dataset_polars.py (3)

Line range hint 203-203: Use isinstance() instead of comparing types directly.

Direct type comparison using type() is not recommended as it can lead to errors if subclassing is involved. Use isinstance() for type checks to ensure proper handling of polymorphism.

- if type(qq) is Path:
+ if isinstance(qq, Path):

Line range hint 1683-1694: Ensure loop variables are correctly bound in lambda functions.

Several lambda functions in the code do not bind the loop variables correctly, leading to potential bugs where the lambda captures the variable from the outer scope instead of the loop iteration. This can be fixed by defaulting the loop variable in the lambda's argument list.

- .name.map(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),
+ .name.map(lambda c, m=m, key_prefix=key_prefix: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),

Also applies to: 1701-1701


Line range hint 2048-2048: Optimize dictionary key access.

Use direct dictionary key checks instead of calling .keys() method, which is unnecessary and less efficient.

- if feature not in cfg.vocabulary.idxmap.keys():
+ if feature not in cfg.vocabulary.idxmap:
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 674b50f and 71ac9a8.

Files selected for processing (3)
  • EventStream/baseline/FT_task_baseline.py (5 hunks)
  • EventStream/data/dataset_polars.py (7 hunks)
  • scripts/build_flat_reps.py (1 hunks)
Additional context used
Ruff
EventStream/baseline/FT_task_baseline.py

254-254: Do not compare types, use isinstance() (E721)


377-377: BaseSklearnModuleConfig is an abstract base class, but it has no abstract methods (B024)


405-405: No explicit stacklevel keyword argument found (B028)


553-553: Yoda conditions are discouraged, use frac_subjects > 0 instead (SIM300)

Replace Yoda condition with frac_subjects > 0

EventStream/data/dataset_polars.py

203-203: Do not compare types, use isinstance() (E721)


1683-1683: Function definition does not bind loop variable m (B023)


1683-1683: Function definition does not bind loop variable key_prefix (B023)


1687-1687: Function definition does not bind loop variable m (B023)


1687-1687: Function definition does not bind loop variable val_prefix (B023)


1689-1689: Function definition does not bind loop variable m (B023)


1689-1689: Function definition does not bind loop variable val_prefix (B023)


1692-1692: Function definition does not bind loop variable m (B023)


1692-1692: Function definition does not bind loop variable val_prefix (B023)


1693-1693: Function definition does not bind loop variable m (B023)


1693-1693: Function definition does not bind loop variable val_prefix (B023)


1694-1694: Function definition does not bind loop variable m (B023)


1694-1694: Function definition does not bind loop variable val_prefix (B023)


1701-1701: Function definition does not bind loop variable m (B023)


2048-2048: Use key in dict instead of key in dict.keys() (SIM118)

Remove .keys()

Additional comments not posted (2)
EventStream/data/dataset_polars.py (2)

693-693: Set subject_id_dtype after checking if events_df is not None.

The assignment of self.subject_id_dtype should be inside the conditional check to ensure events_df is not None before accessing its schema.

-        self.subject_id_dtype = self.events_df.schema["subject_id"]
-        if self.events_df is not None:
+        if self.events_df is not None:
+            self.subject_id_dtype = self.events_df.schema["subject_id"]

740-753: Improve type conversion and error handling in _filter_col_inclusion.

The conversion of inclusion targets to the appropriate data type can be simplified by using polars built-in functions, and the error message can be enhanced for better clarity.

-                        logger.debug(
-                            f"Converting inclusion targets of type {type(list(incl_targets)[0])} for "
-                            f"{col} to {df.schema[col]}"
-                        )
-                        if isinstance(list(incl_targets)[0], str):
-                            incl_list = pl.Series(list(incl_targets), dtype=pl.Utf8)
-                        else:
-                            incl_list = pl.Series(list(incl_targets), dtype=df.schema[col])
-                        incl_list = incl_list.cast(df.schema[col])
-                        logger.debug(
-                            f"Converted to Series of type {incl_list.dtype} with size {len(incl_list)}"
-                        )
+                        try:
+                            incl_list = pl.Series(list(incl_targets)).cast(df.schema[col])
+                        except TypeError as e:
+                            raise TypeError(f"Failed to cast inclusion targets to column '{col}' schema type.") from e

Comment on lines 21 to 45
cfg = hydra.utils.instantiate(cfg, _convert_="all")
save_dir = Path(cfg.pop("save_dir"))
window_sizes = cfg.pop("window_sizes")
subjects_per_output_file = (
cfg.pop("subjects_per_output_file") if "subjects_per_output_file" in cfg else None
)

# Build flat reps for specified task and window sizes
ESD = Dataset.load(save_dir)
feature_inclusion_frequency, include_only_measurements = ESD._resolve_flat_rep_cache_params(
feature_inclusion_frequency=None, include_only_measurements=None
)
cache_kwargs = dict(
subjects_per_output_file=subjects_per_output_file,
feature_inclusion_frequency=feature_inclusion_frequency, # 0.1
window_sizes=window_sizes,
include_only_measurements=include_only_measurements,
do_overwrite=False,
do_update=True,
)
ESD.cache_flat_representation(**cache_kwargs)


if __name__ == "__main__":
main()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure proper error handling in the main function.

The function uses several pop operations on the configuration dictionary without checking if the keys exist, which could lead to KeyError if they are missing. Consider adding default values or error handling to prevent runtime errors.

-    save_dir = Path(cfg.pop("save_dir"))
-    window_sizes = cfg.pop("window_sizes")
+    save_dir = Path(cfg.get("save_dir", "default_directory"))
+    window_sizes = cfg.get("window_sizes", [])
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cfg = hydra.utils.instantiate(cfg, _convert_="all")
save_dir = Path(cfg.pop("save_dir"))
window_sizes = cfg.pop("window_sizes")
subjects_per_output_file = (
cfg.pop("subjects_per_output_file") if "subjects_per_output_file" in cfg else None
)
# Build flat reps for specified task and window sizes
ESD = Dataset.load(save_dir)
feature_inclusion_frequency, include_only_measurements = ESD._resolve_flat_rep_cache_params(
feature_inclusion_frequency=None, include_only_measurements=None
)
cache_kwargs = dict(
subjects_per_output_file=subjects_per_output_file,
feature_inclusion_frequency=feature_inclusion_frequency, # 0.1
window_sizes=window_sizes,
include_only_measurements=include_only_measurements,
do_overwrite=False,
do_update=True,
)
ESD.cache_flat_representation(**cache_kwargs)
if __name__ == "__main__":
main()
cfg = hydra.utils.instantiate(cfg, _convert_="all")
save_dir = Path(cfg.get("save_dir", "default_directory"))
window_sizes = cfg.get("window_sizes", [])
subjects_per_output_file = (
cfg.pop("subjects_per_output_file") if "subjects_per_output_file" in cfg else None
)
# Build flat reps for specified task and window sizes
ESD = Dataset.load(save_dir)
feature_inclusion_frequency, include_only_measurements = ESD._resolve_flat_rep_cache_params(
feature_inclusion_frequency=None, include_only_measurements=None
)
cache_kwargs = dict(
subjects_per_output_file=subjects_per_output_file,
feature_inclusion_frequency=feature_inclusion_frequency, # 0.1
window_sizes=window_sizes,
include_only_measurements=include_only_measurements,
do_overwrite=False,
do_update=True,
)
ESD.cache_flat_representation(**cache_kwargs)
if __name__ == "__main__":
main()

Comment on lines +19 to +20
@hydra.main(version_base=None, config_path="../configs", config_name="dataset_base")
def main(cfg: DictConfig):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add documentation for the main function.

The new main function lacks a docstring. Adding a docstring would help other developers understand the purpose and usage of this function.

+    """
+    Main function to build flat representation datasets.
+    Args:
+        cfg (DictConfig): Configuration object containing parameters for dataset processing.
+    """
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@hydra.main(version_base=None, config_path="../configs", config_name="dataset_base")
def main(cfg: DictConfig):
@hydra.main(version_base=None, config_path="../configs", config_name="dataset_base")
def main(cfg: DictConfig):
"""
Main function to build flat representation datasets.
Args:
cfg (DictConfig): Configuration object containing parameters for dataset processing.
"""

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range and nitpick comments (2)
EventStream/data/dataset_polars.py (2)

Line range hint 1686-1697: Bind loop variable m and val_prefix inside the lambda function.

The loop variable m and val_prefix are not bound inside the lambda function, which may lead to unexpected behavior if the function is executed after the loop has finished iterating. Use a default argument to bind the current value of the variable at each iteration.

-                .name.map(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/has_values_count")
+                .name.map(lambda c, m=m, val_prefix=val_prefix: f"dynamic/{m}/{c.replace(val_prefix, '')}/has_values_count")

Line range hint 2049-2049: Optimize dictionary key existence check.

Instead of using key in dict.keys(), it's more efficient and pythonic to use key in dict.

- if feature not in cfg.vocabulary.idxmap.keys():
+ if feature not in cfg.vocabulary.idxmap:
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 71ac9a8 and a813468.

Files selected for processing (1)
  • EventStream/data/dataset_polars.py (14 hunks)
Additional context used
Ruff
EventStream/data/dataset_polars.py

203-203: Do not compare types, use isinstance() (E721)


1686-1686: Function definition does not bind loop variable m (B023)


1686-1686: Function definition does not bind loop variable key_prefix (B023)


1690-1690: Function definition does not bind loop variable m (B023)


1690-1690: Function definition does not bind loop variable val_prefix (B023)


1692-1692: Function definition does not bind loop variable m (B023)


1692-1692: Function definition does not bind loop variable val_prefix (B023)


1695-1695: Function definition does not bind loop variable m (B023)


1695-1695: Function definition does not bind loop variable val_prefix (B023)


1696-1696: Function definition does not bind loop variable m (B023)


1696-1696: Function definition does not bind loop variable val_prefix (B023)


1697-1697: Function definition does not bind loop variable m (B023)


1697-1697: Function definition does not bind loop variable val_prefix (B023)


1704-1704: Function definition does not bind loop variable m (B023)


2049-2049: Use key in dict instead of key in dict.keys() (SIM118)

Remove .keys()

@@ -690,6 +690,7 @@ def _agg_by_time(self):
)

def _update_subject_event_properties(self):
self.subject_id_dtype = self.events_df.schema["subject_id"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure events_df is not None before accessing its schema.

The assignment of self.subject_id_dtype should be inside the conditional check to ensure events_df is not None before accessing its schema.

-        self.subject_id_dtype = self.events_df.schema["subject_id"]
-        if self.events_df is not None:
+        if self.events_df is not None:
+            self.subject_id_dtype = self.events_df.schema["subject_id"]
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.subject_id_dtype = self.events_df.schema["subject_id"]
if self.events_df is not None:
self.subject_id_dtype = self.events_df.schema["subject_id"]

Comment on lines +740 to +753
logger.debug(
f"Converting inclusion targets of type {type(list(incl_targets)[0])} for "
f"{col} to {df.schema[col]}"
)
if isinstance(list(incl_targets)[0], str):
incl_list = pl.Series(list(incl_targets), dtype=pl.Utf8)
else:
incl_list = pl.Series(list(incl_targets), dtype=df.schema[col])

incl_list = incl_list.cast(df.schema[col])

logger.debug(
f"Converted to Series of type {incl_list.dtype} with size {len(incl_list)}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve type conversion and error handling in _filter_col_inclusion.

The conversion of inclusion targets to the appropriate data type can be simplified by using polars built-in functions, and the error message can be enhanced for better clarity.

-                        logger.debug(
-                            f"Converting inclusion targets of type {type(list(incl_targets)[0])} for "
-                            f"{col} to {df.schema[col]}"
-                        )
-                        if isinstance(list(incl_targets)[0], str):
-                            incl_list = pl.Series(list(incl_targets), dtype=pl.Utf8)
-                        else:
-                            incl_list = pl.Series(list(incl_targets), dtype=df.schema[col])
-                        incl_list = incl_list.cast(df.schema[col])
-                        logger.debug(
-                            f"Converted to Series of type {incl_list.dtype} with size {len(incl_list)}"
-                        )
+                        try:
+                            incl_list = pl.Series(list(incl_targets)).cast(df.schema[col])
+                        except TypeError as e:
+                            raise TypeError(f"Failed to cast inclusion targets to column '{col}' schema type.") from e
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logger.debug(
f"Converting inclusion targets of type {type(list(incl_targets)[0])} for "
f"{col} to {df.schema[col]}"
)
if isinstance(list(incl_targets)[0], str):
incl_list = pl.Series(list(incl_targets), dtype=pl.Utf8)
else:
incl_list = pl.Series(list(incl_targets), dtype=df.schema[col])
incl_list = incl_list.cast(df.schema[col])
logger.debug(
f"Converted to Series of type {incl_list.dtype} with size {len(incl_list)}"
)
try:
incl_list = pl.Series(list(incl_targets)).cast(df.schema[col])
except TypeError as e:
raise TypeError(f"Failed to cast inclusion targets to column '{col}' schema type.") from e

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (3)
EventStream/data/dataset_polars.py (3)

Line range hint 203-203: Refactor to use isinstance for type checking.

Using isinstance() instead of comparing types directly is more Pythonic and supports subclass checking.

- if type(list(incl_targets)[0]) is str:
+ if isinstance(list(incl_targets)[0], str):

Line range hint 1683-1694: Function definition does not bind loop variable m and val_prefix.

The loop variable m and val_prefix are not bound in the lambda function, which might lead to unexpected behavior due to late binding.

- .name.map(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/count"),
+ .name.map(lambda c, m=m, val_prefix=val_prefix: f"dynamic/{m}/{c.replace(val_prefix, '')}/count"),

Line range hint 2046-2046: Use key in dict instead of key in dict.keys().

It is more efficient and Pythonic to check for key existence directly in the dictionary.

- if feature not in cfg.vocabulary.idxmap.keys():
+ if feature not in cfg.vocabulary.idxmap:
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between a813468 and 87a3874.

Files selected for processing (1)
  • EventStream/data/dataset_polars.py (14 hunks)
Additional context used
Ruff
EventStream/data/dataset_polars.py

203-203: Do not compare types, use isinstance() (E721)


1683-1683: Function definition does not bind loop variable m (B023)


1683-1683: Function definition does not bind loop variable key_prefix (B023)


1687-1687: Function definition does not bind loop variable m (B023)


1687-1687: Function definition does not bind loop variable val_prefix (B023)


1689-1689: Function definition does not bind loop variable m (B023)


1689-1689: Function definition does not bind loop variable val_prefix (B023)


1692-1692: Function definition does not bind loop variable m (B023)


1692-1692: Function definition does not bind loop variable val_prefix (B023)


1693-1693: Function definition does not bind loop variable m (B023)


1693-1693: Function definition does not bind loop variable val_prefix (B023)


1694-1694: Function definition does not bind loop variable m (B023)


1694-1694: Function definition does not bind loop variable val_prefix (B023)


1701-1701: Function definition does not bind loop variable m (B023)


2046-2046: Use key in dict instead of key in dict.keys() (SIM118)

Remove .keys()

Additional comments not posted (2)
EventStream/data/dataset_polars.py (2)

693-693: Ensure events_df is not None before accessing its schema.

The assignment of self.subject_id_dtype should be inside the conditional check to ensure events_df is not None before accessing its schema.


740-753: Improve type conversion and error handling in _filter_col_inclusion.

The conversion of inclusion targets to the appropriate data type can be simplified by using polars built-in functions, and the error message can be enhanced for better clarity.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (1)
EventStream/data/dataset_base.py (1)

Line range hint 1488-1488: Rename unused loop variable.

The loop control variable m is not used within the loop body. To avoid confusion, consider renaming it to _m.

- for m, cfg in self.measurement_configs.items():
+ for _m, cfg in self.measurement_configs.items():
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 87a3874 and 5150e05.

Files ignored due to path filters (1)
  • configs/dataset_base.yaml is excluded by !**/*.yaml
Files selected for processing (2)
  • EventStream/data/dataset_base.py (2 hunks)
  • scripts/build_flat_reps.py (1 hunks)
Additional context used
Ruff
EventStream/data/dataset_base.py

1488-1488: Loop control variable m not used within loop body

Rename unused m to _m

(B007)

Additional comments not posted (9)
scripts/build_flat_reps.py (5)

4-17: Imports LGTM!

The imports for stackprinter, hydra, DictConfig, and Dataset are correctly included and necessary for the script's functionality.


19-20: Add documentation for the main function.

The new main function lacks a docstring. Adding a docstring would help other developers understand the purpose and usage of this function.

+    """
+    Main function to build flat representation datasets.
+    Args:
+        cfg (DictConfig): Configuration object containing parameters for dataset processing.
+    """

[EXISTING_COMMENT]


22-23: Ensure proper error handling in the main function.

The function uses several pop operations on the configuration dictionary without checking if the keys exist, which could lead to KeyError if they are missing. Consider adding default values or error handling to prevent runtime errors.

-    save_dir = Path(cfg.pop("save_dir"))
-    window_sizes = cfg.pop("window_sizes")
+    save_dir = Path(cfg.get("save_dir", "default_directory"))
+    window_sizes = cfg.get("window_sizes", [])

[EXISTING_COMMENT]


41-41: Function call LGTM!

The cache_flat_representation function is called with the necessary keyword arguments. Ensure that the function implementation handles these arguments correctly.


44-45: Main function call LGTM!

The main function is correctly called within the if __name__ == "__main__": block.

EventStream/data/dataset_base.py (4)

1242-1247: Changes LGTM!

The changes to handle parameter files more robustly in the cache_flat_representation function are correctly implemented. Ensure that the function handles potential errors and includes necessary documentation.


1373-1390: Optimize JSON file handling.

Consider using json.load() instead of json.loads() with read_text(). This can be optimized by directly loading the JSON content using json.load(), which streams the file content.

- shards = json.loads(shards_fp.read_text())
+ with open(shards_fp, 'r') as file:
+     shards = json.load(file)

[EXISTING_COMMENT]


Line range hint 114-116: Changes LGTM!

The assignment for self.subject_id_dtype in the _update_subject_event_properties function is correctly implemented.


1242-1247: Changes LGTM!

The changes to handle parameter files more robustly in the cache_flat_representation function are correctly implemented. Ensure that the function handles potential errors and includes necessary documentation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5150e05 and 532c3dd.

Files selected for processing (1)
  • EventStream/data/dataset_base.py (5 hunks)
Files skipped from review as they are similar to previous changes (1)
  • EventStream/data/dataset_base.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants