Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Separate concept tags from captures and store capture-to-tag mapping in the lexer. #72

Merged
merged 581 commits into from
Feb 13, 2025

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Jan 13, 2025

References

Description

  • Previously tags were being used to refer to a single capture group, as well as the start and end markers for a capture group's position in the NFA.
    • The former has been changed to be referred to as a capture.
    • The latter is now stored as a unique unsigned integer.
  • To simplify information tracking and ownership transfer, the lexer is now responsible for keeping track of all the relational information it will need after parsing. This includes:
    • A map from each variable id to the capture id's for the groups the variable contains.
    • A map from each capture to its start and end tag.
    • A map from each tag to its final register.
  • Fix filer order in cmake.
  • Use *_id_t aliases to make it more clear what the intended purpose of the maps are.

Validation performed

  • Added new unit-test for the lexer's base functionality.
  • Added new unit-test for the lexer's capture group functionality. This includes testing the maps that can currently be assigned.

Summary by CodeRabbit

  • New Features

    • Enhanced regular expression processing and capture handling for improved error reporting and consistent token management.
    • Upgraded lexical analysis to provide more robust recognition and assignment of tokens, ensuring smoother operation.
    • Introduced a mechanism for generating unique identifiers and clear type definitions, bolstering overall system reliability.
    • Added new methods to the Lexer class for improved capture and tag management.
    • Introduced a new Capture class to replace the previous Tag class, streamlining capture handling.
  • Tests

    • Expanded test coverage to validate the new capture and lexing improvements, ensuring higher quality and stability.
    • Introduced unit tests for the new Capture class and enhanced tests for the lexer.
  • Chores

    • Streamlined build configuration and source management for improved development efficiency.

SharafMohamed and others added 30 commits December 2, 2024 21:14
…g_id; Remove error checking in favor of using .at().
…ative position test as this is what is seen in practice when using negative positions.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/log_surgeon/finite_automata/RegexAST.hpp (2)

118-118: Consider using empty() instead of comparing with false.

The condition false == m_negative_captures.empty() can be simplified to !m_negative_captures.empty() for better readability.

-        if (false == m_negative_captures.empty()) {
+        if (!m_negative_captures.empty()) {

910-940: Update diagram comments to reflect capture-based terminology.

The diagram's comments still use tag-based terminology in some places. Consider updating them to consistently use capture-based terminology.

-    // The NFA constructed for a capture group follows the structure below, with tagged transitions
+    // The NFA constructed for a capture group follows the structure below, with capture transitions
-    //                    | (positive tagged start transition)
+    //                    | (positive capture start transition)
-    //                    | (negative tagged transition)
+    //                    | (negative capture transition)
-    //                    | (positive tagged end transition)
+    //                    | (positive capture end transition)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5e2e53 and ddae224.

📒 Files selected for processing (3)
  • src/log_surgeon/Lexer.hpp (5 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (21 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp
  • src/log_surgeon/Lexer.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/finite_automata/RegexAST.hpp
🧠 Learnings (1)
src/log_surgeon/finite_automata/RegexAST.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
🔇 Additional comments (2)
src/log_surgeon/finite_automata/RegexAST.hpp (2)

23-23: LGTM! Documentation and includes updated to reflect the new capture-based system.

The changes consistently replace tag-based terminology with capture-based terminology throughout the documentation and includes.

Also applies to: 32-38


728-728: LGTM! Consistent use of capture-based serialization across all derived classes.

The changes to use serialize_negative_captures are applied consistently throughout all derived classes.

Also applies to: 745-745, 776-776, 807-807, 840-840, 903-903, 962-962, 1118-1118

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last few comments and we're close to merge.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

feat: Separate concept tags from captures and store capture-to-tag mapping in the lexer.

@SharafMohamed SharafMohamed changed the title feat: separate the concept of captures and tags; lexer now tracks mapping from variables to capture to tags to registers. feat: Separate concept tags from captures and store capture-to-tag mapping in the lexer. Feb 13, 2025
@SharafMohamed SharafMohamed merged commit 6904e51 into y-scope:main Feb 13, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants