Adding memory efficient updates #17

mmcdermott · 2024-06-13T00:04:32Z

Summary by CodeRabbit

New Features
- Enhanced data type specification in aggregation functions for improved calculation precision.
- Introduced new functions for efficient sparse matrix assignments and precomputation.
Refactor
- Split aggregation logic into separate precomputation and aggregation functions for better modularity.
- Improved data transformation and processing steps for time series data to enhance performance and accuracy.
Bug Fixes
- Adjusted dynamic measurement summarization to ensure correct shape calculation for sparse matrix generation.

… is slightly slower now, debugging this, but memory usage is stable

coderabbitai · 2024-06-13T00:04:39Z

Walkthrough

The recent updates enhance the functionality and efficiency of the generate_summarized_reps.py and generate_ts_features.py modules. Key changes include the introduction of new parameters and functions to handle data types and matrix assignments, restructuring of functions for better aggregation and data transformation, and improved handling of time series data. These adjustments aim to optimize the processing of sparse matrices and dynamic measurements, ensuring more accurate and efficient data handling.

Changes

Files	Change Summary
`src/.../generate_summarized_reps.py`	Added `dtype` parameter to `sparse_aggregate`, introduced `sparse_assign` function, split `aggregate_matrix` into `precompute_matrix` and `aggregate_matrix`.
`src/.../generate_ts_features.py`	Refactored `get_long_code_df` for efficient data transformation, updated `summarize_dynamic_measurements` to calculate shape for sparse matrix generation.

Poem

In data's dance, a change we see,
With dtype guiding, more precise we'll be.
Sparse matrices now align with care,
Aggregations swift, transformations fair.
Time series flow in harmony's might,
Our code now shines, a beacon bright.
🌟🐇✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov-commenter · 2024-06-13T00:08:39Z

Codecov Report

Attention: Patch coverage is 6.66667% with 56 lines in your changes missing coverage. Please review.

Project coverage is 37.80%. Comparing base (530ebff) to head (4d69bb6).
Report is 264 commits behind head on main.

Files with missing lines	Patch %	Lines
...rc/MEDS_tabular_automl/generate_summarized_reps.py	8.16%	45 Missing ⚠️
src/MEDS_tabular_automl/generate_ts_features.py	0.00%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #17      +/-   ##
==========================================
- Coverage   38.64%   37.80%   -0.84%     
==========================================
  Files          14       14              
  Lines         942      968      +26     
==========================================
+ Hits          364      366       +2     
- Misses        578      602      +24

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 530ebff and 4d69bb6.

Files selected for processing (2)

src/MEDS_tabular_automl/generate_summarized_reps.py (3 hunks)
src/MEDS_tabular_automl/generate_ts_features.py (2 hunks)

Additional context used

Ruff

src/MEDS_tabular_automl/generate_summarized_reps.py

6-6: Module level import not at top of file (E402)

7-7: Module level import not at top of file (E402)

9-9: Module level import not at top of file (E402)

10-10: Module level import not at top of file (E402)

11-16: Module level import not at top of file (E402)

Additional comments not posted (2)

src/MEDS_tabular_automl/generate_ts_features.py (1)
95-104: Ensure the sparse matrix generation handles different aggregation types correctly.

Verification successful

The function summarize_dynamic_measurements handles different aggregation types (CODE_AGGREGATIONS and VALUE_AGGREGATIONS) and generates a sparse matrix accordingly. The implementation appears to correctly differentiate between these aggregation types and processes them using appropriate helper functions (get_long_code_df and get_long_value_df).

The function checks if agg is in CODE_AGGREGATIONS or VALUE_AGGREGATIONS.

For CODE_AGGREGATIONS, it drops the numerical_value column and processes the data using get_long_code_df.

For VALUE_AGGREGATIONS, it drops the ID columns and processes the data using get_long_value_df.

The sparse matrix is then created using the processed data and the appropriate shape.

The sparse matrix generation seems to handle different aggregation types correctly based on the provided implementation.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that all aggregation types are correctly handled in the sparse matrix generation.

# Test: Search for the function usage with different aggregation types. Expect: Correct handling for each type.
rg --type python $'summarize_dynamic_measurements'
Length of output: 80

Script:
#!/bin/bash
# Description: Verify that all aggregation types are correctly handled in the sparse matrix generation.

# Test: Search for the function definition and usage of `summarize_dynamic_measurements`.
rg 'def summarize_dynamic_measurements' -A 10
rg 'summarize_dynamic_measurements' -A 5
Length of output: 1770

Script:
#!/bin/bash
# Description: Extract the full implementation of `summarize_dynamic_measurements` and its usage context.

# Extract the full implementation of `summarize_dynamic_measurements`.
rg 'def summarize_dynamic_measurements' -A 50

# Extract the surrounding context of its usage.
rg 'return summarize_dynamic_measurements' -A 10 -B 10
Length of output: 5484
src/MEDS_tabular_automl/generate_summarized_reps.py (1)
19-27: Ensure the sparse_aggregate function correctly handles different aggregation methods.

Verification successful

The sparse_aggregate function correctly handles different aggregation methods, including "sum", "min", "max", "sum_sqd", and "count". The agg variable covers all these methods, ensuring the function's correctness.

agg is assigned values that include "sum", "min", "max", "sum_sqd", and "count".
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that all aggregation methods are correctly implemented in the `sparse_aggregate` function.

# Test: Search for the function usage with different aggregation methods. Expect: Correct implementation for each method.
rg --type python $'sparse_aggregate'
Length of output: 66

Script:
#!/bin/bash
# Description: Verify that all aggregation methods are correctly implemented in the `sparse_aggregate` function.

# Test: Search for the function usage with different aggregation methods. Expect: Correct implementation for each method.
rg --type py 'sparse_aggregate'
Length of output: 378

Script:
#!/bin/bash
# Description: Extract and review the lines of code where `sparse_aggregate` is called to verify the usage of different aggregation methods.

# Extract lines where `sparse_aggregate` is called.
rg --type py 'sparse_aggregate' -A 3 -B 3
Length of output: 1937

Script:
#!/bin/bash
# Description: Extract specific lines where `sparse_aggregate` is called to verify the usage of different aggregation methods.

# Extract specific lines where `sparse_aggregate` is called.
rg --type py 'sparse_aggregate' -n
Length of output: 390

Script:
#!/bin/bash
# Description: Extract lines around 36 and 79 in `generate_summarized_reps.py` to verify the aggregation methods used in `sparse_aggregate` calls.

# Extract lines around 36 and 79 in `generate_summarized_reps.py`.
sed -n '33,39p;76,82p' src/MEDS_tabular_automl/generate_summarized_reps.py
Length of output: 714

Script:
#!/bin/bash
# Description: Search for the definitions or assignments of `value` and `agg` in the `generate_summarized_reps.py` file to verify the aggregation methods used in `sparse_aggregate` calls.

# Search for the definitions or assignments of `value`.
rg --type py 'value =' src/MEDS_tabular_automl/generate_summarized_reps.py

# Search for the definitions or assignments of `agg`.
rg --type py 'agg =' src/MEDS_tabular_automl/generate_summarized_reps.py
Length of output: 351

coderabbitai · 2024-06-13T00:14:23Z

src/MEDS_tabular_automl/generate_ts_features.py

+    x = df.with_columns(
+        pl.col("code").cast(str).replace(column_to_int).cast(int).alias("code_index"),
+        pl.lit(1).alias("count"),
+    ).drop("code")
+    # sum up counts for same patient_id, timestamp, code_index
+    x = x.group_by("patient_id", "timestamp", "code_index").sum()
+    # combine codes and counts for same patient_id, timestamp
+    x = x.group_by("patient_id", "timestamp", maintain_order=True).agg(pl.col("code_index", "count"))
+
+    # repeat row_index for each code_index on that row (i.e. 1 row == 1 unique patient_id x timestamp)
+    rows = (
+        x.with_row_index("row_index")
+        .select(pl.col("row_index").repeat_by(pl.col("code_index").list.len()))
+        .select(pl.col("row_index").explode())
        .collect()
-        .to_series()
        .to_numpy()
-    )
-    assert np.issubdtype(cols.dtype, np.number), "numerical_value must be a numerical type"
-    data = np.ones(df.select(pl.len()).collect().item(), dtype=np.bool_)
-    return data, (rows, cols)
+        .T
+    )[0]
+    cols = x.select(pl.col("code_index").explode()).collect().to_numpy().T[0]
+    data = x.select(pl.col("count").explode()).collect().to_numpy().T[0]
+    shape = (x.select(pl.len()).collect().item(), len(ts_columns))
+    return data, (rows, cols), shape


Refactor the get_long_code_df function to improve clarity and efficiency.

- x = df.with_columns( - pl.col("code").cast(str).replace(column_to_int).cast(int).alias("code_index"), - pl.lit(1).alias("count"), - ).drop("code") - x = x.group_by("patient_id", "timestamp", "code_index").sum() - x = x.group_by("patient_id", "timestamp", maintain_order=True).agg(pl.col("code_index", "count")) - rows = ( - x.with_row_index("row_index") - .select(pl.col("row_index").repeat_by(pl.col("code_index").list.len())) - .select(pl.col("row_index").explode()) - .collect() - .to_numpy() - .T - )[0] - cols = x.select(pl.col("code_index").explode()).collect().to_numpy().T[0] - data = x.select(pl.col("count").explode()).collect().to_numpy().T[0] - shape = (x.select(pl.len()).collect().item(), len(ts_columns)) + # Simplify the transformation and aggregation steps + x = df.with_columns( + pl.col("code").cast(str).replace(column_to_int).cast(int).alias("code_index"), + pl.lit(1).alias("count") + ).drop("code") + .group_by("patient_id", "timestamp", "code_index").agg(pl.sum("count").alias("count")) + .group_by("patient_id", "timestamp", maintain_order=True).agg(pl.list("code_index").alias("code_indices"), pl.list("count").alias("counts")) + + # Flatten the lists to get the final rows, cols, and data arrays + rows = x.select(pl.explode(pl.col("code_indices").repeat_by(pl.col("counts")))).to_numpy().flatten() + cols = x.select(pl.explode(pl.col("code_indices"))).to_numpy().flatten() + data = x.select(pl.explode(pl.col("counts"))).to_numpy().flatten() + shape = (x.select(pl.len()).collect().item(), len(ts_columns))

This refactoring simplifies the data transformation and aggregation steps, making the code easier to understand and potentially more efficient by reducing the number of operations and intermediate data structures.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

x = df.with_columns(

pl.col("code").cast(str).replace(column_to_int).cast(int).alias("code_index"),

pl.lit(1).alias("count"),

).drop("code")

# sum up counts for same patient_id, timestamp, code_index

x = x.group_by("patient_id", "timestamp", "code_index").sum()

# combine codes and counts for same patient_id, timestamp

x = x.group_by("patient_id", "timestamp", maintain_order=True).agg(pl.col("code_index", "count"))

# repeat row_index for each code_index on that row (i.e. 1 row == 1 unique patient_id x timestamp)

rows = (

x.with_row_index("row_index")

.select(pl.col("row_index").repeat_by(pl.col("code_index").list.len()))

.select(pl.col("row_index").explode())

.collect()

.to_series()

.to_numpy()

)

assert np.issubdtype(cols.dtype, np.number), "numerical_value must be a numerical type"

data = np.ones(df.select(pl.len()).collect().item(), dtype=np.bool_)

return data, (rows, cols)

.T

)[0]

cols = x.select(pl.col("code_index").explode()).collect().to_numpy().T[0]

data = x.select(pl.col("count").explode()).collect().to_numpy().T[0]

shape = (x.select(pl.len()).collect().item(), len(ts_columns))

return data, (rows, cols), shape

# Simplify the transformation and aggregation steps

x = df.with_columns(

pl.col("code").cast(str).replace(column_to_int).cast(int).alias("code_index"),

pl.lit(1).alias("count")

).drop("code")

.group_by("patient_id", "timestamp", "code_index").agg(pl.sum("count").alias("count"))

.group_by("patient_id", "timestamp", maintain_order=True).agg(pl.list("code_index").alias("code_indices"), pl.list("count").alias("counts"))

# Flatten the lists to get the final rows, cols, and data arrays

rows = x.select(pl.explode(pl.col("code_indices").repeat_by(pl.col("counts")))).to_numpy().flatten()

cols = x.select(pl.explode(pl.col("code_indices"))).to_numpy().flatten()

data = x.select(pl.explode(pl.col("counts"))).to_numpy().flatten()

shape = (x.select(pl.len()).collect().item(), len(ts_columns))

return data, (rows, cols), shape

coderabbitai · 2024-06-13T00:14:23Z

src/MEDS_tabular_automl/generate_summarized_reps.py

+def sparse_assign(row_index, index, data, row, col, sparse_matrix, value, dtype):
+    agg_matrix = sparse_aggregate(sparse_matrix, value, dtype)
+    if isinstance(agg_matrix, np.ndarray):
+        nozero_ind = agg_matrix.nonzero()[0]
+        len_data = len(nozero_ind)
+        col[index : index + len_data] = nozero_ind
+        data[index : index + len_data] = agg_matrix[nozero_ind]
+        row[index : index + len_data] = row_index
+    elif isinstance(agg_matrix, coo_array):
+        len_data = len(agg_matrix.data)
+        col[index : index + len_data] = agg_matrix.col
+        data[index : index + len_data] = agg_matrix.data
+        row[index : index + len_data] = row_index
+    else:
+        raise TypeError(f"Invalid matrix type {type(agg_matrix)}")
+    return len_data


Refactor the sparse_assign function to improve clarity and efficiency.

- agg_matrix = sparse_aggregate(sparse_matrix, value, dtype) - if isinstance(agg_matrix, np.ndarray): - nozero_ind = agg_matrix.nonzero()[0] - len_data = len(nozero_ind) - col[index : index + len_data] = nozero_ind - data[index : index + len_data] = agg_matrix[nozero_ind] - row[index : index + len_data] = row_index - elif isinstance(agg_matrix, coo_array): - len_data = len(agg_matrix.data) - col[index : index + len_data] = agg_matrix.col - data[index : index + len_data] = agg_matrix.data - row[index : index + len_data] = row_index - else: - raise TypeError(f"Invalid matrix type {type(agg_matrix)}") + # Simplify the handling of different matrix types + agg_matrix = sparse_aggregate(sparse_matrix, value, dtype) + len_data = len(agg_matrix.data) + col[index : index + len_data] = agg_matrix.col + data[index : index + len_data] = agg_matrix.data + row[index : index + len_data] = row_index

This refactoring simplifies the handling of different matrix types, making the code easier to understand and potentially more efficient by reducing the number of conditional branches.

Committable suggestion was skipped due to low confidence.

mmcdermott · 2024-07-12T02:06:56Z

@Oufattole is this going to be merged or should we close it unmerged? I'd rather not have it hanging around indefinitely.

mmcdermott · 2024-08-10T16:34:13Z

Closing as this has gone a long time without update from the original code author.

Oufattole and others added 3 commits June 12, 2024 06:42

[WIP] speed boost

7ae45c5

pre allocated memory and fixed bug with memory usage spiking. Runtime…

3b91897

… is slightly slower now, debugging this, but memory usage is stable

fixed shape

092eb35

Merged

4d69bb6

coderabbitai bot reviewed Jun 13, 2024

View reviewed changes

teyaberg force-pushed the main branch from a305a55 to 48550d1 Compare July 28, 2024 16:38

mmcdermott closed this Aug 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding memory efficient updates #17

Adding memory efficient updates #17

mmcdermott commented Jun 13, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 13, 2024 •

edited

Loading

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (`.coderabbit.yaml`)

Documentation and Community

codecov-commenter commented Jun 13, 2024 •

edited by codecov bot

Loading

coderabbitai bot left a comment

coderabbitai bot Jun 13, 2024

coderabbitai bot Jun 13, 2024

mmcdermott commented Jul 12, 2024

mmcdermott commented Aug 10, 2024

Adding memory efficient updates #17

Adding memory efficient updates #17

Conversation

mmcdermott commented Jun 13, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Jun 13, 2024 • edited Loading

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

codecov-commenter commented Jun 13, 2024 • edited by codecov bot Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 13, 2024

Choose a reason for hiding this comment

mmcdermott commented Jul 12, 2024

mmcdermott commented Aug 10, 2024

mmcdermott commented Jun 13, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 13, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)

codecov-commenter commented Jun 13, 2024 •

edited by codecov bot

Loading