Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some small changes to update things for more recent versions of packages. #115

Merged
merged 2 commits into from
Jun 22, 2024

Conversation

mmcdermott
Copy link
Owner

No description provided.

Copy link

coderabbitai bot commented Jun 22, 2024

Walkthrough

The updates across several files aim to replace usage of np.NaN with float('nan') to handle missing or filtered values uniformly. This change improves consistency in handling outliers and null values across different functions, datasets, and tests within the data library, while also refining boundary conditions and value transformation processes.

Changes

Files Summary
EventStream/data/README.md Updated representation of missing or filtered values from np.NaN to float('nan').
EventStream/data/dataset_polars.py Refactored drop_or_censor, _transform_numerical_measurement, and _transform_categorical_measurement to use float('nan') instead of np.NaN, and adjusted boundary condition handling.
EventStream/data/pytorch_dataset.py Modified to_int_index function to handle null values and sorting differently, influencing added columns and return values.
tests/data/test_pytorch_dataset.py Updated the dynamic_values lists to use float("nan") instead of np.NaN for null value representation.

Sequence Diagram(s)

Sequence diagrams are not applicable for these changes as the updates mainly involve data representation and handling refinements, without significant alterations to control flow or feature interactions.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (4)
EventStream/data/pytorch_dataset.py (1)

Line range hint 209-209: Potential override of loop variable n_events

The static analysis has flagged a potential issue where the loop control variable n_events may override the iterable it iterates over. This could lead to unexpected behavior or bugs.

- for i, (subj, n_events) in enumerate(zip(subject_ids, n_events)):
+ for i, (subj, num_events) in enumerate(zip(subject_ids, n_events)):
EventStream/data/dataset_polars.py (3)

Line range hint 203-203: Refactor type checking to use isinstance().

The static analysis tool suggests using isinstance() for type checking instead of direct type comparison. This is a more Pythonic way of checking types and can handle cases where a variable might be an instance of a subclass.

- if type(qq) is Path:
+ if isinstance(qq, Path):

Line range hint 1648-1666: Ensure loop variables are correctly bound in lambda functions.

Several warnings have been generated about lambda functions not binding loop variables correctly within list comprehensions. This can lead to unexpected behavior where the lambda captures the same (last) value for each iteration instead of the intended loop-specific value.

- .name.map(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),
+ .name.map(lambda c, m=m, key_prefix=key_prefix: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),

Line range hint 2013-2013: Optimize dictionary key existence check.

The static analysis tool suggests using a more efficient method for checking key existence in a dictionary. Instead of using key in dict.keys(), which unnecessarily creates a list of keys, use key in dict.

- if feature not in cfg.vocabulary.idxmap.keys():
+ if feature not in cfg.vocabulary.idxmap:
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between e77b5c9 and 5b4a279.

Files ignored due to path filters (1)
  • pyproject.toml is excluded by !**/*.toml
Files selected for processing (4)
  • EventStream/data/README.md (2 hunks)
  • EventStream/data/dataset_polars.py (6 hunks)
  • EventStream/data/pytorch_dataset.py (2 hunks)
  • tests/data/test_pytorch_dataset.py (2 hunks)
Additional context used
Ruff
EventStream/data/pytorch_dataset.py

209-209: Loop control variable n_events overrides iterable it iterates (B020)

EventStream/data/dataset_polars.py

203-203: Do not compare types, use isinstance() (E721)


1648-1648: Function definition does not bind loop variable m (B023)


1648-1648: Function definition does not bind loop variable key_prefix (B023)


1652-1652: Function definition does not bind loop variable m (B023)


1652-1652: Function definition does not bind loop variable val_prefix (B023)


1654-1654: Function definition does not bind loop variable m (B023)


1654-1654: Function definition does not bind loop variable val_prefix (B023)


1657-1657: Function definition does not bind loop variable m (B023)


1657-1657: Function definition does not bind loop variable val_prefix (B023)


1658-1658: Function definition does not bind loop variable m (B023)


1658-1658: Function definition does not bind loop variable val_prefix (B023)


1659-1659: Function definition does not bind loop variable m (B023)


1659-1659: Function definition does not bind loop variable val_prefix (B023)


1666-1666: Function definition does not bind loop variable m (B023)


2013-2013: Use key in dict instead of key in dict.keys() (SIM118)

Remove .keys()

LanguageTool
EventStream/data/README.md

[style] ~7-~7: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE)
Context: ...ep-learning applications and models. 3. For general modeling and data analysis. Th...


[style] ~55-~55: Although a hyphen is possible, it is not necessary in a compound modifier in which the first word is an adverb that ends in ‘ly’. (HYPHENATED_LY_ADVERB_ADJECTIVE)
Context: ...ta are presented to models in a sparse, fully-temporally realized format which should be more ef...


[style] ~58-~58: ‘with respect to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_WITH_RESPECT_TO)
Context: ...yle of data embedding overall (at least with respect to the per-timestamp data embedding practi...


[uncategorized] ~63-~63: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...nly the following data: 1. subject_id, which is the identifier for the subject...


[uncategorized] ~64-~64: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ntifier for the subject. 2. start_time, which is the start timestamp for the su...


[uncategorized] ~64-~64: It seems likely that a singular genitive (’s) apostrophe is missing. (AI_HYDRA_LEO_APOSTROPHE_S_XS)
Context: ..., which is the start timestamp for the subjects overall record. 3. static_indices`, wh...


[uncategorized] ~65-~65: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ects overall record. 3. static_indices, which contains a list of integral indic...


[uncategorized] ~68-~68: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...sistent. 4. static_measurement_indices, which contains a list of the integral i...


[uncategorized] ~71-~71: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... length with static_indices. 5. time, which contains a list of the time in mi...


[uncategorized] ~72-~72: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... event in question. 6. dynamic_indices, which contains a list of lists, where t...


[uncategorized] ~75-~75: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... event. 7. dynamic_measurement_indices, which is of the same (ragged) shape as ...


[uncategorized] ~77-~77: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...n dynamic_indices. 8. dynamic_values, which is of the same (ragged) shape as ...


[uncategorized] ~93-~93: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...g to how they vary in time: 1. STATIC: in which case they are unchanging and c...


[uncategorized] ~94-~94: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... subject. 2. FUNCTIONAL_TIME_DEPENDENT: in which case they can be specified in ...


[uncategorized] ~96-~96: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: .../or a continuous timestamp. 3. DYNAMIC: in which case they are time-varying, bu...


[grammar] ~98-~98: The adjective “many-to-one” is spelled with hyphens. (ONE_TO_MANY_HYPHEN)
Context: ...asurements are linked to events in a many to one fashion and are identified via a separa...


[style] ~107-~107: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing. (REP_NEED_TO_VB)
Context: ...or deep-learning applications (e.g., we need to use embedding matrices for categoric...


[grammar] ~109-~109: It looks like the word order is not correct. (WORD_ORDER_QUESTIONS)
Context: ...us values for numerical values). 3. How one would go about generating such data within a ...


[uncategorized] ~115-~115: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ently: 1. SINGLE_LABEL_CLASSIFICATION: In this case, the measure is assumed to...


[uncategorized] ~117-~117: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...maximum. 2. MULTI_LABEL_CLASSIFICATION: In this case, the measure is assumed to...


[uncategorized] ~120-~120: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...asurements. 3. MULTIVARIATE_REGRESSION: In this case, the measure is assumed to...


[grammar] ~122-~122: This phrase is duplicated. You should probably use “the values” only once. (PHRASE_REPETITION)
Context: ... multivariate regression problem and the values the values observed. This keys for this measure ar...


[uncategorized] ~125-~125: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...r that event. 4. UNIVARIATE_REGRESSION: In this case, the measure is assumed to...


[uncategorized] ~137-~137: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...rocessing of those values: 1. INTEGER: the observed values will be converted t...


[uncategorized] ~138-~138: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ill be converted to integers. 2. FLOAT: the observed values will remain as floa...


[uncategorized] ~138-~138: If this is a compound adjective that modifies the following noun, use a hyphen. (EN_COMPOUND_ADJECTIVE_INTERNAL)
Context: ...AT: the observed values will remain as floating point values. 3. CATEGORICAL_INTEGER`: the o...


[uncategorized] ~139-~139: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...g point values. 3. CATEGORICAL_INTEGER: the observed values will be converted t...


[style] ~140-~140: ‘on the basis of’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
Context: ... normalized into a set of categories on the basis of the values observed. 4. `CATEGORICAL_FL...


[uncategorized] ~141-~141: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ... values observed. 4. CATEGORICAL_FLOAT: the observed values will be normalized ...


[style] ~141-~141: ‘on the basis of’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
Context: ... be normalized into a set of categories on the basis of the values observed. #### Configura...


[uncategorized] ~147-~147: Do not mix variants of the same word (‘modelling’ and ‘modeling’) within a single text. (EN_EXACT_COHERENCY_RULE)
Context: ....Dataset` object should pre-process for modelling use, but it also is filled during pre-p...


[uncategorized] ~152-~152: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Context: ...ds include: 1. MeasurementConfig.name: contains the name of the measurement, a...


[style] ~153-~153: ‘on the basis of’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
Context: ...manually in practice; it will be filled on the basis of how the measurement config is stored...


[grammar] ~164-~164: The expression “vice versa” is spelled without hyphens. (VICE_VERSA)
Context: ...able to a plain-old-data dictionary and vice-versa) that is used to compute the column val...


[style] ~183-~183: The words ‘observed’ and ‘observation’ are quite similar. Consider replacing ‘observed’ with a different word. (VERB_NOUN_SENT_LEVEL_REP)
Context: ...res how many times that measurement was observed with a non-null value (or a non-null...


[style] ~185-~185: The words ‘observed’ and ‘observation’ are quite similar. Consider replacing ‘observed’ with a different word. (VERB_NOUN_SENT_LEVEL_REP)
Context: ...) per possible instance where it was observed at all (e.g., all possible subjects wit...


[style] ~201-~201: The phrase “A variety of” may be wordy. To make your writing clearer, consider replacing it. (A_VARIETY_OF)
Context: ...qual to the keys in this dictionary. 2. A variety of preprocessing simple parameters, which ...


[style] ~227-~227: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE)
Context: ... passed as a parameter to the function. For saving (loading) the dataset will write...


[uncategorized] ~230-~230: Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short). (COMMA_COMPOUND_SENTENCE)
Context: ... we recommend using the parquet format and it is the only format supported out of ...


[uncategorized] ~268-~268: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...d normalizer model is a standard scaler model which centers data to have zero mean an...


[style] ~280-~280: ‘on the basis of’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
Context: ...events in the data, they are determined on the basis of a priori, known functions whose sole in...


[uncategorized] ~301-~301: Do not mix variants of the same word (‘modelling’ and ‘modeling’) within a single text. (EN_EXACT_COHERENCY_RULE)
Context: ...uring dataset pre-processing for use in modelling. After transformation, column types wil...


[grammar] ~320-~320: Two determiners in a row. Choose either “a” or “the”. (DT_DT)
Context: ..._df.index` and indicates to which event a the row's metadata element corresponds. Man...


[uncategorized] ~331-~331: Do not mix variants of the same word (‘modelling’ and ‘modeling’) within a single text. (EN_EXACT_COHERENCY_RULE)
Context: ...uring dataset pre-processing for use in modelling. ### EventStream.data.PytorchDataset...


[uncategorized] ~349-~349: Do not mix variants of the same word (‘modelling’ and ‘modeling’) within a single text. (EN_EXACT_COHERENCY_RULE)
Context: ...s suitable for embedding and downstream modelling over randomly chosen sub-sequences with...


[uncategorized] ~350-~350: Do not mix variants of the same word (‘modelling’ and ‘modeling’) within a single text. (EN_EXACT_COHERENCY_RULE)
Context: ... mode is useful for generative sequence modelling, but less so for supervised learning, i...


[uncategorized] ~354-~354: The preposition ‘for’ seems more likely in this position. (AI_HYDRA_LEO_REPLACE_AT_FOR)
Context: ...ers can also specify a task_df object at construction of this dataset. This data...


[style] ~370-~370: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE)
Context: ... the sequence length for patient i. - Let K denote the number of static data el...


[style] ~371-~371: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE)
Context: ...ta elements observed for patient i. - Let M[j] denote the number of per-event d...


[uncategorized] ~400-~400: When ‘event-specific’ is used as a modifier, it is usually spelled with a hyphen. (SPECIFIC_HYPHEN)
Context: ...equence. - 'dynamic_*' corresponds to event specific metadata elements describing each seque...


[style] ~401-~401: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE)
Context: ...ng each sequence event. - '*_indices' corresponds to the categorical index of the data el...


[style] ~409-~409: You’ve already used the word ‘also’ once in your sentence, so using it again may be redundant. (REDUNDANT_FILLER)
Context: ...ls were also specified, then there will also be an entry in the output dictionary per t...


[misspelling] ~415-~415: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’. (EN_A_VS_AN)
Context: ...essed like a dictionary, but it is also a object stored in types.py of class `E...


[style] ~423-~423: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE)
Context: ...er-batch maximum sequence length. - M is the per-batch maximum number of data el...


[style] ~424-~424: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE)
Context: ...umber of data elements per event. - K is the per-batch maximum number of static ...


[style] ~448-~448: You’ve already used the word ‘also’ once in your sentence, so using it again may be redundant. (REDUNDANT_FILLER)
Context: ...ls were also specified, then there will also be a dictionary at key stream_labels wit...


[style] ~462-~462: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ...ng matrices of differing dimensions, prior to being combined via projection layers an...

Markdownlint
EventStream/data/README.md

373-373: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified


426-426: null (MD040, fenced-code-language)
Fenced code blocks should have a language specified

Additional comments not posted (7)
tests/data/test_pytorch_dataset.py (2)

182-182: Updated representation of NaN values in test data

The change from np.NaN to float('nan') in dynamic_values is consistent with the updates across the project to standardize NaN representations. This will ensure that tests remain aligned with the implementation.


244-244: Updated representation of NaN values in test data

The change from np.NaN to float('nan') in dynamic_values is consistent with the updates across the project to standardize NaN representations. This will ensure that tests remain aligned with the implementation.

EventStream/data/pytorch_dataset.py (1)

57-57: Updated handling of null values and sorting in to_int_index

The new implementation of to_int_index correctly handles null values by explicitly returning None for them and uses a more efficient sorting mechanism. This change should improve the robustness and performance of index generation.

EventStream/data/README.md (1)

79-80: Updated Representation of Missing or Outlier Values

The documentation now consistently uses float('nan') instead of np.NaN to represent missing or filtered outlier values. This change aligns with the updates across the project's codebase to standardize NaN representations.

EventStream/data/dataset_polars.py (3)

434-455: Updated handling of bounds to use float('nan').

The function drop_or_censor has been updated to use float('nan') instead of np.NaN for handling boundary conditions. This change is consistent with the PR's objective to update the codebase for more recent versions of packages. The documentation and parameters have been updated accordingly to reflect this change. Ensure that all references to this function across the project are compatible with these changes.


1181-1181: Updated handling of numerical values to use float('nan').

The function _transform_numerical_measurement now uses float('nan') in place of np.NaN when handling numerical measurements that do not meet certain criteria. This update aligns with the overall changes in the project to standardize the representation of NaN values.


1243-1243: Updated handling of categorical values to use float('nan').

The function _transform_categorical_measurement has been updated to use float('nan') for handling categorical measurements that are not found in the vocabulary. This change ensures consistency with the new standard for NaN values across the project.

@mmcdermott mmcdermott merged commit ce9e2c3 into dev Jun 22, 2024
2 checks passed
@mmcdermott mmcdermott deleted the update_packages branch June 22, 2024 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant