Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Extract RegexDFAState class, RegexDFAStatePair class, and RegexDFAStateType enum into their own files. #57

Merged
merged 372 commits into from
Dec 11, 2024

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Dec 5, 2024

  • Doing a git diff will provide better comparisons for files with changed names than using the Github UI:
git fetch upstream
git fetch upstream pull/57/head:pr-57
git diff upstream/main pr-57 

Description

Separate DFA Functionality into different files.

  • RegexDFAState class moved into its own file.
  • RegexDFAStatePair class moved into its own file.
  • RegexDFAStateType enum moved into its own file.

Validation performed

Previously existing tests still succeed.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced new header files for managing DFA states and their types.
    • Added functionality for creating and managing DFA states, including methods for transitions and acceptance checks.
    • Implemented logic for computing intersections of DFAs.
  • Bug Fixes

    • Streamlined the DFA state management by consolidating redundant structures.
  • Documentation

    • Updated documentation to reflect new classes and methods for better clarity on DFA state management.

SharafMohamed and others added 30 commits October 24, 2024 11:48
…ransitions) return nullopt if state_ids is malformed.
…ake it clear to the reader that both failures are handled the same way and return nullopt. For more complicated return cases it would warrant the reader looking at the doc for the individual functions, but here I think we can make their life easier.
…nor are parts of the rules stored, instead the rules are only read and used to build the NFA.
…call succeeds in NFA's serialize.

Co-authored-by: Lin Zhihao <[email protected]>
…on classes when they are initialized in their constructor.
…egativeTaggedTransition classes into their own files.
…just an id. This object is created and owned by the capture AST, and other AST and NFA states point to these tags.
Copy link

coderabbitai bot commented Dec 5, 2024

Walkthrough

The changes in this pull request involve significant updates to the CMake build configuration and modifications to the DFA-related classes in the log_surgeon::finite_automata namespace. The CMakeLists.txt file has been revised to manage new header files and dependencies more effectively. Key class definitions have been added, removed, or modified in the RegexDFA, RegexDFAState, and RegexDFAStatePair files, streamlining the handling of DFA states and their intersections.

Changes

File Path Change Summary
CMakeLists.txt - Removed src/log_surgeon/finite_automata/RegexDFA.tpp
- Added headers: RegexDFAState.hpp, RegexDFAStatePair.hpp, RegexDFAStateType.hpp
- Adjusted dependency handling for fmt and GSL using FetchContent
- Set CMAKE_EXPORT_COMPILE_COMMANDS to ON
- Defined default build type and shared library options.
src/log_surgeon/finite_automata/RegexDFA.hpp - Removed definitions for RegexDFAState and RegexDFAStatePair
- Updated new_state and get_intersect method implementations.
src/log_surgeon/finite_automata/RegexDFA.tpp - File deleted; contained implementations for DFA state transitions and management functions.
src/log_surgeon/finite_automata/RegexDFAState.hpp - Added class template RegexDFAState
- Introduced methods for managing variable IDs and state transitions.
src/log_surgeon/finite_automata/RegexDFAStatePair.hpp - Added class RegexDFAStatePair to represent pairs of DFA states for intersection construction.
- Included methods for state comparison and reachable pairs generation.
src/log_surgeon/finite_automata/RegexDFAStateType.hpp - Introduced enum class RegexDFAStateType with states Byte and UTF8.
examples/intersect-test.cpp - Updated get_intersect_for_query function to pass a pointer to dfa2 instead of the object directly when calling get_intersect.
src/log_surgeon/Lexer.hpp - Added include directive for RegexDFAState.hpp.

Possibly related PRs

Suggested reviewers

  • LinZhihao-723

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (3)
src/log_surgeon/finite_automata/RegexDFAState.hpp (1)

35-35: Pass small integral type uint8_t by value instead of by const reference

The parameter byte in add_byte_transition is a uint8_t, which is more efficiently passed by value rather than by const reference. Passing small integral types by value avoids unnecessary indirection.

Apply this diff to pass byte by value:

-auto add_byte_transition(uint8_t const& byte, RegexDFAState<stateType>* dest_state) -> void {
+auto add_byte_transition(uint8_t byte, RegexDFAState<stateType>* dest_state) -> void {
src/log_surgeon/finite_automata/RegexDFA.hpp (1)

65-65: Use idiomatic condition in the while loop

Consider using !unvisited_pairs.empty() instead of false == unvisited_pairs.empty() for better readability and to align with common coding practices.

Apply this diff:

-    while (false == unvisited_pairs.empty()) {
+    while (!unvisited_pairs.empty()) {
src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1)

70-80: Offer assistance to implement UTF-8 transitions

The method get_reachable_pairs currently handles only single-byte transitions, as indicated by the TODO comment. Supporting UTF-8 (multi-byte) transitions is essential for full Unicode compatibility.

I can help implement support for UTF-8 transitions in this method. Would you like me to provide a solution or open a GitHub issue to track this enhancement?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 99b5b08 and 66ed13b.

📒 Files selected for processing (6)
  • CMakeLists.txt (1 hunks)
  • src/log_surgeon/finite_automata/RegexDFA.hpp (2 hunks)
  • src/log_surgeon/finite_automata/RegexDFA.tpp (0 hunks)
  • src/log_surgeon/finite_automata/RegexDFAState.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexDFAStateType.hpp (1 hunks)
💤 Files with no reviewable changes (1)
  • src/log_surgeon/finite_automata/RegexDFA.tpp
✅ Files skipped from review due to trivial changes (1)
  • src/log_surgeon/finite_automata/RegexDFAStateType.hpp
🧰 Additional context used
📓 Learnings (3)
CMakeLists.txt (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
src/log_surgeon/finite_automata/RegexDFA.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
src/log_surgeon/finite_automata/RegexDFAState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
🔇 Additional comments (1)
CMakeLists.txt (1)

100-102: Addition of new source files is correct

The inclusion of the new source files RegexDFAState.hpp, RegexDFAStatePair.hpp, and RegexDFAStateType.hpp into the build system is appropriate. This ensures that the newly created classes are compiled and linked properly.

src/log_surgeon/finite_automata/RegexDFAState.hpp Outdated Show resolved Hide resolved
src/log_surgeon/finite_automata/RegexDFA.hpp Outdated Show resolved Hide resolved
@SharafMohamed SharafMohamed changed the title refactor: Separate DFA functionality into different files. refactor: Extract RegexDFAState class, RegexDFAStatePair class, and RegexDFAStateType enum into their own files. Dec 6, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
src/log_surgeon/finite_automata/RegexDFA.hpp (2)

12-14: Consider addressing the TODO comment in this PR

The TODO suggests renaming RegexDFA to DFA. Since this PR is already refactoring the DFA-related code, it might be a good opportunity to implement this naming change as well.

Would you like me to help with renaming the class and updating all its references?


60-60: Track the UTF-8 handling requirement

The TODO comment indicates missing functionality for handling UTF-8 multi-byte transitions.

Would you like me to create a GitHub issue to track the UTF-8 handling requirement?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 66ed13b and 176391b.

📒 Files selected for processing (2)
  • src/log_surgeon/finite_automata/RegexDFA.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexDFAState.hpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/log_surgeon/finite_automata/RegexDFAState.hpp
🧰 Additional context used
📓 Path-based instructions (1)
src/log_surgeon/finite_automata/RegexDFA.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
src/log_surgeon/finite_automata/RegexDFA.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
🔇 Additional comments (3)
src/log_surgeon/finite_automata/RegexDFA.hpp (3)

27-34: LGTM! Parameter type changed as suggested

The method signature has been updated to use a raw pointer instead of a unique_ptr, which is more appropriate for this use case where ownership transfer isn't needed.


40-52: LGTM! Clean implementation of state creation

The implementation properly manages memory using smart pointers and correctly handles the transfer of matching variable IDs from NFA to DFA states.


54-72: LGTM! Well-structured intersection algorithm

The implementation:

  • Correctly manages state pair traversal
  • Properly collects matching variable IDs
  • Follows coding guidelines with false == unvisited_pairs.empty()

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
src/log_surgeon/finite_automata/RegexDFAState.hpp (1)

55-55: Optimise memory usage by conditionally including m_bytes_transition

The member m_bytes_transition is always included in RegexDFAState, even when stateType is UTF8, which can lead to unnecessary memory usage. Consider conditionally including m_bytes_transition only when stateType is Byte, similar to how m_tree_transitions is conditionally included.

src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1)

71-71: Implement UTF-8 handling in get_reachable_pairs

There is a TODO comment indicating that UTF-8 (multi-byte transitions) are not yet handled in the get_reachable_pairs method. Implementing this functionality is important for full UTF-8 support. Would you like assistance in adding this implementation?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 176391b and 0decaf5.

📒 Files selected for processing (4)
  • examples/intersect-test.cpp (1 hunks)
  • src/log_surgeon/Lexer.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexDFAState.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/log_surgeon/Lexer.hpp
🧰 Additional context used
📓 Path-based instructions (3)
examples/intersect-test.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexDFAState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
src/log_surgeon/finite_automata/RegexDFAState.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
🔇 Additional comments (2)
src/log_surgeon/finite_automata/RegexDFAState.hpp (1)

63-65: ⚠️ Potential issue

Add bounds check on character in Byte state

Accessing m_bytes_transition[character] without validating character may lead to out-of-bounds access if character is greater than or equal to cSizeOfByte. Consider adding an assertion to ensure character is within bounds to prevent potential errors.

Apply this diff to add the assertion:

 if constexpr (RegexDFAStateType::Byte == stateType) {
+    assert(character < cSizeOfByte);
     return m_bytes_transition[character];
 }
examples/intersect-test.cpp (1)

45-45: Function call to get_intersect updated correctly

The get_intersect function call has been correctly updated to pass a raw pointer using dfa2.get(), aligning with the updated function signature. This ensures proper functionality.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactoring looks good to me. Let's make an agreement on the macro naming and then we can merge

@@ -0,0 +1,80 @@
#ifndef LOG_SURGEON_FINITE_AUTOMATA_REGEX_DFA_STATE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to miss this in previous refactor PRs: I think we should name macros to exactly match the file name, so this header should be LOG_SURGEON_FINITE_AUTOMATA_REGEXDFASTATE instead. We can create an issue to keep track of this and fix them all together later

Copy link
Contributor Author

@SharafMohamed SharafMohamed Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk sounds good, I'll create the issue. I was previously separating it on capitalization, e.g. log_surgeon/finite_automate/DfaState would use #ifndef LOG_SURGEON_FINITE_AUTOMATA_DFA_STATE as the correct snake_case naming for the separate words (as we're combining snake_case folder names and camal_case file names).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue created.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title looks good to me.
The macro naming issue is tracked here: #65

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants