refactor: Move `epsilon_closure` and `nfa_to_dfa` from `Lexer` class to `Dfa` class. #62

SharafMohamed · 2024-12-05T17:30:49Z

References

Depends on PR#61.
To review in parallel with PR#61, diff against PR#61 locally. In the repo run:

git fetch upstream pull/61/head:pr-61
git fetch upstream pull/62/head:pr-62
git diff pr-61 pr-62

Description

Move epsilon_closure from Lexer class to NfaState class.
Move functionality from nfa_to_dfa in Lexer class to Dfa class constructor.

Validation performed

Previously existing tests succeed.

Summary by CodeRabbit

Release Notes

New Features
- Introduced new Dfa and Nfa classes for enhanced finite automata functionality.
- Added new source files for finite automata components.
Improvements
- Renamed parser-related classes and updated naming conventions for consistency.
- Enhanced type handling and clarity in method signatures across various components.
Bug Fixes
- Improved error handling in LogEvent and LogParser classes.
Documentation
- Updated comments and documentation for consistency with naming conventions.

…pp files; Also remove RegexDFA.tpp file.

…ector instead of set.

Co-authored-by: Lin Zhihao <[email protected]>

… one in case merging caused issues.

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (15)

src/log_surgeon/finite_automata/DfaStatePair.hpp (5)
11-21: Consider enhancing the documentation.

While the current documentation is clear, consider adding:

Requirements/constraints for the template parameter TypedDfaState

Thread safety guarantees for the class

22-28: Consider adding safety checks and documentation.

The constructor could be improved by:

Adding nullptr validation for the state pointers

Documenting pointer ownership semantics

Adding noexcept specification if appropriate
 template <typename TypedDfaState>
 class DfaStatePair {
 public:
+    /**
+     * @brief Constructs a pair of DFA states
+     * @param state1 Non-null pointer to first DFA state (not owned)
+     * @param state2 Non-null pointer to second DFA state (not owned)
+     * @throws std::invalid_argument if either pointer is null
+     */
     DfaStatePair(TypedDfaState const* state1, TypedDfaState const* state2)
             : m_state1(state1),
-              m_state2(state2) {};
+              m_state2(state2) {
+        if (state1 == nullptr || state2 == nullptr) {
+            throw std::invalid_argument("DFA state pointers cannot be null");
+        }
+    }
42-64: Consider enhancing method documentation and specifications.

The member functions could benefit from:

More detailed documentation for get_reachable_pairs:

Complexity guarantees

Exception safety guarantees

noexcept specifications where appropriate for is_accepting and get_matching_variable_ids

71-71: Address the TODO comment regarding UTF-8 support.

The comment indicates missing UTF-8 support, which could lead to incorrect behaviour with multi-byte characters.

Would you like me to help create a GitHub issue to track the UTF-8 implementation task?

72-82: Consider performance optimizations for state pair generation.

The implementation could benefit from:

Documentation of time and space complexity

Pre-reserving space in the sets based on the expected number of transitions
 auto DfaStatePair<TypedDfaState>::get_reachable_pairs(
         std::set<DfaStatePair>& visited_pairs,
         std::set<DfaStatePair>& unvisited_pairs
 ) const -> void {
+    // Reserve space for potential new pairs (maximum cSizeOfByte new pairs)
+    unvisited_pairs.reserve(unvisited_pairs.size() + cSizeOfByte);
     // TODO: Handle UTF-8 (multi-byte transitions) as well
     for (uint32_t i = 0; i < cSizeOfByte; i++) {
src/log_surgeon/finite_automata/Nfa.hpp (2)
Line range hint 171-171: Prefer false == <expression> over !<expression> for clarity

According to the coding guidelines, it's preferred to use false == <expression> instead of !<expression>. Please update the following lines:

Line 171: while (!unused_item_sets.empty()) { should be while (false == unused_item_sets.empty()) {

Line 220: while (!q.empty()) { should be while (false == q.empty()) {

Line 383: while (!queue.empty()) { should be while (false == queue.empty()) {

Apply these diffs:
- while (!unused_item_sets.empty()) {
+ while (false == unused_item_sets.empty()) {
- while (!q.empty()) {
+ while (false == q.empty()) {
- while (!queue.empty()) {
+ while (false == queue.empty()) {
Also applies to: 220-220, 383-383

Line range hint 241-241: Use false == <expression> instead of !<expression>

At line 241, if (!item.has_dot_at_end()) { should be updated to if (false == item.has_dot_at_end()) { to adhere to the coding guidelines.

Apply this diff:
- if (!item.has_dot_at_end()) {
+ if (false == item.has_dot_at_end()) {
src/log_surgeon/Lalr1Parser.tpp (2)
110-110: Prefer false == <expression> over !<expression> for consistency

Please update the following lines to match the coding guidelines:

Line 110: while (!q.empty()) { → while (false == q.empty()) {

Line 273: while (changed) { → while (false == changed) {

Line 335: while (changed) { → while (false == changed) {

Line 383: while (!queue.empty()) { → while (false == queue.empty()) {

Apply these diffs:
- while (!q.empty()) {
+ while (false == q.empty()) {
- while (changed) {
+ while (false == changed) {
- while (!queue.empty()) {
+ while (false == queue.empty()) {
Also applies to: 273-273, 335-335, 383-383

571-571: Use false == <expression> instead of !<expression> for assertions

At line 571, please update the assertion to follow the coding guideline:

Change assert(!m_parse_stack_matches.empty()); to assert(false == m_parse_stack_matches.empty());

Apply this diff:
- assert(!m_parse_stack_matches.empty());
+ assert(false == m_parse_stack_matches.empty());
src/log_surgeon/LexicalRule.hpp (1)

28-30: Address TODO comment regarding constant pointer return type.

The TODO comment indicates a need to make the returned pointer constant. This would prevent modification of the underlying RegexAST through the getter.

Would you like me to propose a solution for making the returned pointer constant?
src/log_surgeon/Lexer.hpp (1)
14-16: Consider grouping related includes together.

The finite automata related includes could be grouped together for better organization:
-#include <log_surgeon/finite_automata/Dfa.hpp>
-#include <log_surgeon/finite_automata/DfaState.hpp>
-#include <log_surgeon/finite_automata/Nfa.hpp>
+// Finite automata related includes
+#include <log_surgeon/finite_automata/Dfa.hpp>
+#include <log_surgeon/finite_automata/DfaState.hpp>
+#include <log_surgeon/finite_automata/Nfa.hpp>
src/log_surgeon/finite_automata/Dfa.hpp (3)
45-60: Consider using std::unordered_map for better performance

The dfa_states map is used for state lookup operations. Since the order of states isn't important here, using std::unordered_map could provide better average-case performance.
-    std::map<StateSet, DfaStateType*> dfa_states;
+    std::unordered_map<StateSet, DfaStateType*> dfa_states;
131-131: Use !unvisited_pairs.empty() for consistency

The condition uses false == unvisited_pairs.empty() which is less common than the ! operator.
-    while (false == unvisited_pairs.empty()) {
+    while (!unvisited_pairs.empty()) {
73-82: TODO comments need attention

There are multiple TODO comments about UTF-8 support. This could impact the functionality for non-ASCII characters.

Would you like me to help create GitHub issues to track the UTF-8 implementation tasks?
src/log_surgeon/finite_automata/NfaState.hpp (1)

174-207: Consider optimizing epsilon closure computation

The current implementation uses a set for visited states and a stack for traversal. Consider using a vector with a boolean flag for visited states if the state IDs are dense, as this could provide better cache locality.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4d30509 and 80979de.

📒 Files selected for processing (27)

CMakeLists.txt (2 hunks)
examples/intersect-test.cpp (4 hunks)
src/log_surgeon/BufferParser.hpp (2 hunks)
src/log_surgeon/Lalr1Parser.cpp (2 hunks)
src/log_surgeon/Lalr1Parser.hpp (8 hunks)
src/log_surgeon/Lalr1Parser.tpp (28 hunks)
src/log_surgeon/Lexer.hpp (5 hunks)
src/log_surgeon/Lexer.tpp (18 hunks)
src/log_surgeon/LexicalRule.hpp (2 hunks)
src/log_surgeon/LogParser.cpp (6 hunks)
src/log_surgeon/LogParser.hpp (4 hunks)
src/log_surgeon/Parser.hpp (1 hunks)
src/log_surgeon/Parser.tpp (2 hunks)
src/log_surgeon/ReaderParser.hpp (2 hunks)
src/log_surgeon/SchemaParser.cpp (3 hunks)
src/log_surgeon/SchemaParser.hpp (4 hunks)
src/log_surgeon/finite_automata/Dfa.hpp (1 hunks)
src/log_surgeon/finite_automata/DfaState.hpp (1 hunks)
src/log_surgeon/finite_automata/DfaStatePair.hpp (1 hunks)
src/log_surgeon/finite_automata/DfaStateType.hpp (1 hunks)
src/log_surgeon/finite_automata/Nfa.hpp (6 hunks)
src/log_surgeon/finite_automata/NfaState.hpp (6 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (29 hunks)
src/log_surgeon/finite_automata/TaggedTransition.hpp (5 hunks)
tests/CMakeLists.txt (1 hunks)
tests/test-NFA.cpp (1 hunks)
tests/test-lexer.cpp (2 hunks)

✅ Files skipped from review due to trivial changes (2)

src/log_surgeon/BufferParser.hpp
src/log_surgeon/ReaderParser.hpp

🚧 Files skipped from review as they are similar to previous changes (12)

src/log_surgeon/finite_automata/DfaStateType.hpp
src/log_surgeon/LogParser.hpp
src/log_surgeon/Parser.hpp
src/log_surgeon/Parser.tpp
tests/CMakeLists.txt
src/log_surgeon/SchemaParser.hpp
src/log_surgeon/LogParser.cpp
src/log_surgeon/finite_automata/TaggedTransition.hpp
tests/test-NFA.cpp
tests/test-lexer.cpp
src/log_surgeon/finite_automata/RegexAST.hpp
src/log_surgeon/Lexer.tpp

🧰 Additional context used

📓 Path-based instructions (11)

src/log_surgeon/Lalr1Parser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Dfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/DfaStatePair.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/DfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/LexicalRule.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/Lalr1Parser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

examples/intersect-test.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/Lexer.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/NfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/SchemaParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)

src/log_surgeon/finite_automata/NfaState.hpp (1)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

🔇 Additional comments (16)

src/log_surgeon/finite_automata/DfaStatePair.hpp (2)

1-9: LGTM! Header guards and includes are well-organized.

The header guards follow the project's naming convention, and the includes are minimal and appropriate for the required functionality.

29-40: LGTM! Well-implemented comparison operator.

The implementation is efficient and correctly documented, with proper const correctness and clear return type specification.

src/log_surgeon/finite_automata/DfaState.hpp (1)

62-62: ⚠️ Potential issue

Add bounds checking in next method to prevent out-of-bounds access

In the next method, when state_type is Byte, accessing m_bytes_transition[character] without verifying that character is within bounds can cause out-of-bounds access if character is greater than or equal to cSizeOfByte. Please add a bounds check to ensure safe access.

Apply this diff to add bounds checking:

if constexpr (DfaStateType::Byte == state_type) {
+    if (character >= cSizeOfByte) {
+        return nullptr;
+    }
    return m_bytes_transition[character];
} else {

src/log_surgeon/Lalr1Parser.cpp (1)

11-13: LGTM

The initialization and member variable adjustments are correctly implemented.

src/log_surgeon/LexicalRule.hpp (1)

9-9: LGTM! Type renaming changes improve clarity.

The renaming from generic types (NFAStateType) to more descriptive types (TypedNfaState) enhances code readability and better reflects the purpose of these types.

Also applies to: 15-15, 24-24, 28-28, 35-35, 38-39

examples/intersect-test.cpp (2)

43-44: Improved DFA construction using constructor.

The change from using ByteLexer::nfa_to_dfa(nfa) to Dfa dfa2(nfa) is a good improvement as it:

Simplifies the code by using constructor instead of static method
Makes the relationship between NFA and DFA more explicit

81-82: Verify test coverage after refactoring.

The DFA construction changes are significant. Although the test cases remain unchanged, we should verify that they still provide adequate coverage.

src/log_surgeon/Lexer.hpp (1)

162-163: LGTM! Type aliases follow new naming convention.

The type aliases in the lexers namespace have been updated to use the new type names, maintaining consistency across the codebase.

CMakeLists.txt (1)

99-105: LGTM! Source files properly organized

The new finite automata source files are correctly added and organized within the existing structure.

src/log_surgeon/finite_automata/NfaState.hpp (1)

37-38: Constructor parameter type matches previous feedback

The constructor now uses std::vector<Tag const*> instead of std::set as per the learning from PR #48.

src/log_surgeon/Lalr1Parser.hpp (4)

123-123: LGTM! Clean and concise implementation.

The removal of the this-> qualifier improves readability while maintaining the same functionality.

161-161: LGTM! Clean and concise implementation.

The removal of the this-> qualifier improves readability while maintaining the same functionality.

168-168: LGTM! Clean and concise implementation.

The single-line implementation improves readability while maintaining the same functionality.

203-204: LGTM! Improved type naming and proper base class member access.

The changes enhance code clarity through:

More descriptive template parameter names
Proper base class member access using the using declaration

Also applies to: 206-206, 215-215, 225-225, 277-277, 397-398, 410-410

src/log_surgeon/SchemaParser.cpp (2)

419-419: LGTM! Improved C++ style consistency.

The change to trailing return type syntax aligns with modern C++ conventions.

484-484: LGTM! Improved C++ style consistency.

The change to trailing return type syntax aligns with modern C++ conventions.

src/log_surgeon/Lexer.hpp

src/log_surgeon/finite_automata/NfaState.hpp

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

examples/intersect-test.cpp (1)
24-24: Consider adding const qualifier to the DFA parameter

Since get_intersect_for_query only calls const member function get_intersect, consider making the parameter const:
-        Dfa<ByteDfaState>& dfa1,
+        Dfa<ByteDfaState> const& dfa1,

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b38b6d7 and 538ae8b.

📒 Files selected for processing (1)

examples/intersect-test.cpp (4 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

examples/intersect-test.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (3)

examples/intersect-test.cpp (3)

8-11: LGTM! Type declarations are clear and consistent

The new type names and namespace organization improve code clarity and maintainability.

Also applies to: 20-20

43-44: LGTM! Efficient construction with move semantics

The construction of NFA and DFA objects properly utilizes move semantics, and the new approach of constructing DFA directly from NFA is more elegant.

81-82: LGTM! Verify test coverage remains unchanged

The refactored construction is consistent with the changes above. Let's verify that all test cases still exercise the same functionality.

✅ Verification successful

All test cases are preserved and type refactoring is complete

The verification confirms:

All test cases using get_intersect_for_query are intact, maintaining the same test coverage
No legacy type names remain in the codebase, confirming successful refactoring

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify test coverage and functionality

# Check if all test cases are preserved
rg -A 1 "get_intersect_for_query.*dfa" examples/intersect-test.cpp

# Verify no old type names remain
rg -l "RegexDFA|RegexNFA|RegexDFAByteState|RegexNFAByteState" || echo "No old type names found"

Length of output: 716

SharafMohamed and others added 30 commits November 7, 2024 03:07

Add nullptr checks.

59cc6cd

Merge branch 'meaningful-tags' into fixed-tagged-nfa

8097a69

Change Tag class functionality to reflect how registers will be used.

9fc41c0

Add register class.

6e5c968

Seperate classes from RegexDFA.hpp and RegexDFA.tpp into their own .h…

e185fe2

…pp files; Also remove RegexDFA.tpp file.

Cleanup NFA and DFA classes.

54357af

Change RegexDFA to Dfa.

2379280

Change REGEX_DFA to DFA.

8fe9d1e

Change DFAStateType to DfaStateType.

a768ae3

RegexNFA -> Nfa.

9fcd7cc

UTF8 -> Utf8.

b112a25

Remove unneeded this->.

653038f

Use auto where possible.

815ac73

Remove nonterminal namespace inside of nonterminal.

cc532c8

Merge branch 'remove-redundant-typenames' into remove-regex-prefix

1a1655a

Merge branch 'remove-regex-prefix' into remove-uneeded-this

dae03e5

Merge branch 'remove-uneeded-this' into use-auto

76ef76f

Use auto where possible for all = operators.

70362d1

Change positive end transitions from a vector to an optional scalar.

039eec9

Temp fix for unit-test until future PR where Tag ptrs are stored in v…

d060bc6

…ector instead of set.

Swap from set to vector to tag pointers to ensure determinism.

f041a37

Better test coverage for tag class.

f72e120

Use constant iterators for elements that should not change.

d5ac1ad

Co-authored-by: Lin Zhihao <[email protected]>

Use braced intiailization in test-tag.cpp.

30f03ed

Use const& for insertion function that can't use move semantics.

d386fc0

Co-authored-by: Lin Zhihao <[email protected]>

Refactor nfa_to_dfa().

d77d17d

Have get_name() return string_view; Update headers.

4024c3e

Remove const from member variable.

22c3b82

Co-authored-by: Lin Zhihao <[email protected]>

Remove const from member variable.

ed55534

Co-authored-by: Lin Zhihao <[email protected]>

Run linter.

534afce

SharafMohamed added 20 commits December 8, 2024 03:16

Merge branch 'remove-redundant-typenames' into remove-regex-prefix

cf5980d

LALR1Parser to Lalr1Parser.

6e65a3e

Linter.

9c2ad81

Rename templates to TypedDfaState and TypedNfaState.

c4fc96b

Rename to Utf8*State and Byte*State.

a6bbaef

Remove RegexDFAStateType.hpp.

ff2dac3

Linter.

7a8982d

Linter again.

7b0a86c

Add missing alogrithm header; Update test-NFA.

2632616

Update intersect-test.cpp with new names.

992a2ec

Linter on intersect-test.cpp.

a5419f0

Merge branch 'remove-regex-prefix' into remove-uneeded-this

ae64f64

Switch this->m_lexer to m_lexer by using Parser::m_lexer.

9153a7c

Linter.

fa0c098

Merge branch 'remove-uneeded-this' into use-auto

845b088

Lint.

2dd6f45

More auto changes.

ee46719

Merge branch 'use-auto' into refactor-nfa-to-dfa

4cebc06

Merge branch 'main' into remove-redundant-typenames

827d39c

Remove old epsilon_closure after double checking it matches the moved…

80979de

… one in case merging caused issues.

coderabbitai bot reviewed Dec 11, 2024

View reviewed changes

src/log_surgeon/Lexer.hpp Outdated Show resolved Hide resolved

src/log_surgeon/finite_automata/NfaState.hpp Show resolved Hide resolved

Merge dfa_to_nfa() into Dfa::Dfa().

b38b6d7

SharafMohamed requested a review from LinZhihao-723 December 11, 2024 20:44

SharafMohamed added 5 commits December 11, 2024 15:46

Merge branch 'remove-redundant-typenames' into remove-regex-prefix

1c2d7b3

Merge branch 'remove-regex-prefix' into remove-uneeded-this

11da5d5

Merge branch 'remove-uneeded-this' into use-auto

3f6966f

Merge branch 'use-auto' into refactor-nfa-to-dfa

11eab2b

Fix intersect_test compiler errrors.

538ae8b

coderabbitai bot reviewed Dec 11, 2024

View reviewed changes

Add const.

74841b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Move `epsilon_closure` and `nfa_to_dfa` from `Lexer` class to `Dfa` class. #62

refactor: Move `epsilon_closure` and `nfa_to_dfa` from `Lexer` class to `Dfa` class. #62

SharafMohamed commented Dec 5, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

refactor: Move epsilon_closure and nfa_to_dfa from Lexer class to Dfa class. #62

Are you sure you want to change the base?

refactor: Move epsilon_closure and nfa_to_dfa from Lexer class to Dfa class. #62

Conversation

SharafMohamed commented Dec 5, 2024 • edited by coderabbitai bot Loading

References

Description

Validation performed

Summary by CodeRabbit

Release Notes

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

refactor: Move `epsilon_closure` and `nfa_to_dfa` from `Lexer` class to `Dfa` class. #62

refactor: Move `epsilon_closure` and `nfa_to_dfa` from `Lexer` class to `Dfa` class. #62

SharafMohamed commented Dec 5, 2024 •

edited by coderabbitai bot

Loading