Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Move epsilon_closure and nfa_to_dfa from Lexer class to Dfa class. #62

Open
wants to merge 442 commits into
base: main
Choose a base branch
from

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Dec 5, 2024

References

  • Depends on PR#61.
  • To review in parallel with PR#61, diff against PR#61 locally. In the repo run:
git fetch upstream pull/61/head:pr-61
git fetch upstream pull/62/head:pr-62
git diff pr-61 pr-62

Description

  • Move epsilon_closure from Lexer class to NfaState class.
  • Move functionality from nfa_to_dfa in Lexer class to Dfa class constructor.

Validation performed

Previously existing tests succeed.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced new Dfa and Nfa classes for enhanced finite automata functionality.
    • Added new source files for finite automata components.
  • Improvements

    • Renamed parser-related classes and updated naming conventions for consistency.
    • Enhanced type handling and clarity in method signatures across various components.
  • Bug Fixes

    • Improved error handling in LogEvent and LogParser classes.
  • Documentation

    • Updated comments and documentation for consistency with naming conventions.

SharafMohamed and others added 30 commits November 7, 2024 03:07
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (15)
src/log_surgeon/finite_automata/DfaStatePair.hpp (5)

11-21: Consider enhancing the documentation.

While the current documentation is clear, consider adding:

  • Requirements/constraints for the template parameter TypedDfaState
  • Thread safety guarantees for the class

22-28: Consider adding safety checks and documentation.

The constructor could be improved by:

  1. Adding nullptr validation for the state pointers
  2. Documenting pointer ownership semantics
  3. Adding noexcept specification if appropriate
 template <typename TypedDfaState>
 class DfaStatePair {
 public:
+    /**
+     * @brief Constructs a pair of DFA states
+     * @param state1 Non-null pointer to first DFA state (not owned)
+     * @param state2 Non-null pointer to second DFA state (not owned)
+     * @throws std::invalid_argument if either pointer is null
+     */
     DfaStatePair(TypedDfaState const* state1, TypedDfaState const* state2)
             : m_state1(state1),
-              m_state2(state2) {};
+              m_state2(state2) {
+        if (state1 == nullptr || state2 == nullptr) {
+            throw std::invalid_argument("DFA state pointers cannot be null");
+        }
+    }

42-64: Consider enhancing method documentation and specifications.

The member functions could benefit from:

  1. More detailed documentation for get_reachable_pairs:
    • Complexity guarantees
    • Exception safety guarantees
  2. noexcept specifications where appropriate for is_accepting and get_matching_variable_ids

71-71: Address the TODO comment regarding UTF-8 support.

The comment indicates missing UTF-8 support, which could lead to incorrect behaviour with multi-byte characters.

Would you like me to help create a GitHub issue to track the UTF-8 implementation task?


72-82: Consider performance optimizations for state pair generation.

The implementation could benefit from:

  1. Documentation of time and space complexity
  2. Pre-reserving space in the sets based on the expected number of transitions
 auto DfaStatePair<TypedDfaState>::get_reachable_pairs(
         std::set<DfaStatePair>& visited_pairs,
         std::set<DfaStatePair>& unvisited_pairs
 ) const -> void {
+    // Reserve space for potential new pairs (maximum cSizeOfByte new pairs)
+    unvisited_pairs.reserve(unvisited_pairs.size() + cSizeOfByte);
     // TODO: Handle UTF-8 (multi-byte transitions) as well
     for (uint32_t i = 0; i < cSizeOfByte; i++) {
src/log_surgeon/finite_automata/Nfa.hpp (2)

Line range hint 171-171: Prefer false == <expression> over !<expression> for clarity

According to the coding guidelines, it's preferred to use false == <expression> instead of !<expression>. Please update the following lines:

  • Line 171: while (!unused_item_sets.empty()) { should be while (false == unused_item_sets.empty()) {
  • Line 220: while (!q.empty()) { should be while (false == q.empty()) {
  • Line 383: while (!queue.empty()) { should be while (false == queue.empty()) {

Apply these diffs:

- while (!unused_item_sets.empty()) {
+ while (false == unused_item_sets.empty()) {
- while (!q.empty()) {
+ while (false == q.empty()) {
- while (!queue.empty()) {
+ while (false == queue.empty()) {

Also applies to: 220-220, 383-383


Line range hint 241-241: Use false == <expression> instead of !<expression>

At line 241, if (!item.has_dot_at_end()) { should be updated to if (false == item.has_dot_at_end()) { to adhere to the coding guidelines.

Apply this diff:

- if (!item.has_dot_at_end()) {
+ if (false == item.has_dot_at_end()) {
src/log_surgeon/Lalr1Parser.tpp (2)

110-110: Prefer false == <expression> over !<expression> for consistency

Please update the following lines to match the coding guidelines:

  • Line 110: while (!q.empty()) {while (false == q.empty()) {
  • Line 273: while (changed) {while (false == changed) {
  • Line 335: while (changed) {while (false == changed) {
  • Line 383: while (!queue.empty()) {while (false == queue.empty()) {

Apply these diffs:

- while (!q.empty()) {
+ while (false == q.empty()) {
- while (changed) {
+ while (false == changed) {
- while (!queue.empty()) {
+ while (false == queue.empty()) {

Also applies to: 273-273, 335-335, 383-383


571-571: Use false == <expression> instead of !<expression> for assertions

At line 571, please update the assertion to follow the coding guideline:

  • Change assert(!m_parse_stack_matches.empty()); to assert(false == m_parse_stack_matches.empty());

Apply this diff:

- assert(!m_parse_stack_matches.empty());
+ assert(false == m_parse_stack_matches.empty());
src/log_surgeon/LexicalRule.hpp (1)

28-30: Address TODO comment regarding constant pointer return type.

The TODO comment indicates a need to make the returned pointer constant. This would prevent modification of the underlying RegexAST through the getter.

Would you like me to propose a solution for making the returned pointer constant?

src/log_surgeon/Lexer.hpp (1)

14-16: Consider grouping related includes together.

The finite automata related includes could be grouped together for better organization:

-#include <log_surgeon/finite_automata/Dfa.hpp>
-#include <log_surgeon/finite_automata/DfaState.hpp>
-#include <log_surgeon/finite_automata/Nfa.hpp>
+// Finite automata related includes
+#include <log_surgeon/finite_automata/Dfa.hpp>
+#include <log_surgeon/finite_automata/DfaState.hpp>
+#include <log_surgeon/finite_automata/Nfa.hpp>
src/log_surgeon/finite_automata/Dfa.hpp (3)

45-60: Consider using std::unordered_map for better performance

The dfa_states map is used for state lookup operations. Since the order of states isn't important here, using std::unordered_map could provide better average-case performance.

-    std::map<StateSet, DfaStateType*> dfa_states;
+    std::unordered_map<StateSet, DfaStateType*> dfa_states;

131-131: Use !unvisited_pairs.empty() for consistency

The condition uses false == unvisited_pairs.empty() which is less common than the ! operator.

-    while (false == unvisited_pairs.empty()) {
+    while (!unvisited_pairs.empty()) {

73-82: TODO comments need attention

There are multiple TODO comments about UTF-8 support. This could impact the functionality for non-ASCII characters.

Would you like me to help create GitHub issues to track the UTF-8 implementation tasks?

src/log_surgeon/finite_automata/NfaState.hpp (1)

174-207: Consider optimizing epsilon closure computation

The current implementation uses a set for visited states and a stack for traversal. Consider using a vector with a boolean flag for visited states if the state IDs are dense, as this could provide better cache locality.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4d30509 and 80979de.

📒 Files selected for processing (27)
  • CMakeLists.txt (2 hunks)
  • examples/intersect-test.cpp (4 hunks)
  • src/log_surgeon/BufferParser.hpp (2 hunks)
  • src/log_surgeon/Lalr1Parser.cpp (2 hunks)
  • src/log_surgeon/Lalr1Parser.hpp (8 hunks)
  • src/log_surgeon/Lalr1Parser.tpp (28 hunks)
  • src/log_surgeon/Lexer.hpp (5 hunks)
  • src/log_surgeon/Lexer.tpp (18 hunks)
  • src/log_surgeon/LexicalRule.hpp (2 hunks)
  • src/log_surgeon/LogParser.cpp (6 hunks)
  • src/log_surgeon/LogParser.hpp (4 hunks)
  • src/log_surgeon/Parser.hpp (1 hunks)
  • src/log_surgeon/Parser.tpp (2 hunks)
  • src/log_surgeon/ReaderParser.hpp (2 hunks)
  • src/log_surgeon/SchemaParser.cpp (3 hunks)
  • src/log_surgeon/SchemaParser.hpp (4 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (1 hunks)
  • src/log_surgeon/finite_automata/DfaState.hpp (1 hunks)
  • src/log_surgeon/finite_automata/DfaStatePair.hpp (1 hunks)
  • src/log_surgeon/finite_automata/DfaStateType.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Nfa.hpp (6 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (6 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (29 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (5 hunks)
  • tests/CMakeLists.txt (1 hunks)
  • tests/test-NFA.cpp (1 hunks)
  • tests/test-lexer.cpp (2 hunks)
✅ Files skipped from review due to trivial changes (2)
  • src/log_surgeon/BufferParser.hpp
  • src/log_surgeon/ReaderParser.hpp
🚧 Files skipped from review as they are similar to previous changes (12)
  • src/log_surgeon/finite_automata/DfaStateType.hpp
  • src/log_surgeon/LogParser.hpp
  • src/log_surgeon/Parser.hpp
  • src/log_surgeon/Parser.tpp
  • tests/CMakeLists.txt
  • src/log_surgeon/SchemaParser.hpp
  • src/log_surgeon/LogParser.cpp
  • src/log_surgeon/finite_automata/TaggedTransition.hpp
  • tests/test-NFA.cpp
  • tests/test-lexer.cpp
  • src/log_surgeon/finite_automata/RegexAST.hpp
  • src/log_surgeon/Lexer.tpp
🧰 Additional context used
📓 Path-based instructions (11)
src/log_surgeon/Lalr1Parser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Dfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/DfaStatePair.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/DfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/LexicalRule.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/Lalr1Parser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

examples/intersect-test.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/Lexer.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/NfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/SchemaParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
src/log_surgeon/finite_automata/NfaState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
🔇 Additional comments (16)
src/log_surgeon/finite_automata/DfaStatePair.hpp (2)

1-9: LGTM! Header guards and includes are well-organized.

The header guards follow the project's naming convention, and the includes are minimal and appropriate for the required functionality.


29-40: LGTM! Well-implemented comparison operator.

The implementation is efficient and correctly documented, with proper const correctness and clear return type specification.

src/log_surgeon/finite_automata/DfaState.hpp (1)

62-62: ⚠️ Potential issue

Add bounds checking in next method to prevent out-of-bounds access

In the next method, when state_type is Byte, accessing m_bytes_transition[character] without verifying that character is within bounds can cause out-of-bounds access if character is greater than or equal to cSizeOfByte. Please add a bounds check to ensure safe access.

Apply this diff to add bounds checking:

if constexpr (DfaStateType::Byte == state_type) {
+    if (character >= cSizeOfByte) {
+        return nullptr;
+    }
    return m_bytes_transition[character];
} else {
src/log_surgeon/Lalr1Parser.cpp (1)

11-13: LGTM

The initialization and member variable adjustments are correctly implemented.

src/log_surgeon/LexicalRule.hpp (1)

9-9: LGTM! Type renaming changes improve clarity.

The renaming from generic types (NFAStateType) to more descriptive types (TypedNfaState) enhances code readability and better reflects the purpose of these types.

Also applies to: 15-15, 24-24, 28-28, 35-35, 38-39

examples/intersect-test.cpp (2)

43-44: Improved DFA construction using constructor.

The change from using ByteLexer::nfa_to_dfa(nfa) to Dfa dfa2(nfa) is a good improvement as it:

  1. Simplifies the code by using constructor instead of static method
  2. Makes the relationship between NFA and DFA more explicit

81-82: Verify test coverage after refactoring.

The DFA construction changes are significant. Although the test cases remain unchanged, we should verify that they still provide adequate coverage.

src/log_surgeon/Lexer.hpp (1)

162-163: LGTM! Type aliases follow new naming convention.

The type aliases in the lexers namespace have been updated to use the new type names, maintaining consistency across the codebase.

CMakeLists.txt (1)

99-105: LGTM! Source files properly organized

The new finite automata source files are correctly added and organized within the existing structure.

src/log_surgeon/finite_automata/NfaState.hpp (1)

37-38: Constructor parameter type matches previous feedback

The constructor now uses std::vector<Tag const*> instead of std::set as per the learning from PR #48.

src/log_surgeon/Lalr1Parser.hpp (4)

123-123: LGTM! Clean and concise implementation.

The removal of the this-> qualifier improves readability while maintaining the same functionality.


161-161: LGTM! Clean and concise implementation.

The removal of the this-> qualifier improves readability while maintaining the same functionality.


168-168: LGTM! Clean and concise implementation.

The single-line implementation improves readability while maintaining the same functionality.


203-204: LGTM! Improved type naming and proper base class member access.

The changes enhance code clarity through:

  1. More descriptive template parameter names
  2. Proper base class member access using the using declaration

Also applies to: 206-206, 215-215, 225-225, 277-277, 397-398, 410-410

src/log_surgeon/SchemaParser.cpp (2)

419-419: LGTM! Improved C++ style consistency.

The change to trailing return type syntax aligns with modern C++ conventions.


484-484: LGTM! Improved C++ style consistency.

The change to trailing return type syntax aligns with modern C++ conventions.

src/log_surgeon/Lexer.hpp Outdated Show resolved Hide resolved
src/log_surgeon/finite_automata/NfaState.hpp Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
examples/intersect-test.cpp (1)

24-24: Consider adding const qualifier to the DFA parameter

Since get_intersect_for_query only calls const member function get_intersect, consider making the parameter const:

-        Dfa<ByteDfaState>& dfa1,
+        Dfa<ByteDfaState> const& dfa1,
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b38b6d7 and 538ae8b.

📒 Files selected for processing (1)
  • examples/intersect-test.cpp (4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
examples/intersect-test.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (3)
examples/intersect-test.cpp (3)

8-11: LGTM! Type declarations are clear and consistent

The new type names and namespace organization improve code clarity and maintainability.

Also applies to: 20-20


43-44: LGTM! Efficient construction with move semantics

The construction of NFA and DFA objects properly utilizes move semantics, and the new approach of constructing DFA directly from NFA is more elegant.


81-82: LGTM! Verify test coverage remains unchanged

The refactored construction is consistent with the changes above. Let's verify that all test cases still exercise the same functionality.

✅ Verification successful

All test cases are preserved and type refactoring is complete

The verification confirms:

  • All test cases using get_intersect_for_query are intact, maintaining the same test coverage
  • No legacy type names remain in the codebase, confirming successful refactoring
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify test coverage and functionality

# Check if all test cases are preserved
rg -A 1 "get_intersect_for_query.*dfa" examples/intersect-test.cpp

# Verify no old type names remain
rg -l "RegexDFA|RegexNFA|RegexDFAByteState|RegexNFAByteState" || echo "No old type names found"

Length of output: 716

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant