Draft: Optimization for GLT search #252

haiqi96 · 2024-01-26T22:50:35Z

TBD

Summary by CodeRabbit

Release Notes

New Features
- Enhanced query processing with improved handling of variable tokens and boundaries.
- Introduced a method to generate human-readable representations of log type entries.
- Added a new class for managing variable boundaries within queries.
Bug Fixes
- Refined error handling for search operations and file reading, improving robustness.
Improvements
- Optimized search functionality for better performance and reliability.
- Updated methods to streamline variable matching and log type boundary management.

coderabbitai · 2024-11-13T12:10:29Z

Walkthrough

The changes in this pull request involve multiple enhancements to the Grep, LogTypeDictionaryEntry, and Query classes, focusing on query token processing, boundary calculations, and log type handling. New methods for finding boundaries and retokenization are introduced, along with a new QueryBoundary class to manage variable boundaries. The search functionality is optimized, and error handling is improved across various components. These modifications collectively enhance the robustness and efficiency of the query processing system.

Changes

File Path	Change Summary
`components/core/src/glt/Grep.cpp`	Added methods: `find_boundaries`, `retokenization`, `get_union_of_bounds`. Updated `generate_logtypes_and_vars_for_subquery` logic.
`components/core/src/glt/Grep.hpp`	Added static method: `get_union_of_bounds`.
`components/core/src/glt/LogTypeDictionaryEntry.cpp`	Added method: `get_human_readable_value` for constructing a human-readable log type entry.
`components/core/src/glt/LogTypeDictionaryEntry.hpp`	Added public method: `get_human_readable_value`.
`components/core/src/glt/Query.cpp`	Added method: `set_logtype_boundary` in `SubQuery`. Updated `clear` and `matches_vars` methods for handling logtype boundaries.
`components/core/src/glt/Query.hpp`	Introduced new class: `QueryBoundary`. Added methods: `get_boundary_by_logtype_id`, `set_logtype_boundary`. Updated `LogtypeQuery` constructor.
`components/core/src/glt/glt/search.cpp`	Updated `search_segments` to use `Grep::search_segment_optimized_and_output`. Enhanced error handling in `open_archive` and `search` functions.
`components/core/src/glt/streaming_archive/reader/Archive.cpp`	Modified `find_message_matching_with_logtype_query_optimized` to optimize row retrieval. Enhanced error handling in `load_filename_dict`.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Query
    participant Grep
    participant LogTypeDictionaryEntry
    participant Archive

    User->>Query: Initiate search
    Query->>Grep: Generate log types and variables
    Grep->>Grep: Find boundaries
    Grep->>Grep: Retokenization
    Grep->>LogTypeDictionaryEntry: Get human-readable value
    Query->>Archive: Load data
    Archive->>Query: Return data
    Query->>User: Display results

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

kirkrodrigues · 2024-11-13T12:13:14Z

Labelled on hold until we have resources to revisit this. Possible actions items are:

Describe the change (logic + testing + benchmark results)
Add any more changes based on new insights

coderabbitai

Actionable comments posted: 9

🧹 Outside diff range and nitpick comments (8)

components/core/src/glt/Grep.hpp (2)
216-220: Add documentation for the new method.

The new method lacks essential documentation. Please add a documentation block that explains:

The purpose of the method

The expected format and constraints of the input sub_queries

The meaning and valid ranges of var_begin_ix and var_end_ix

Any error conditions or edge cases that callers should be aware of

Consider this format:
/**
 * Computes the union of variable bounds across multiple sub-queries
 * @param sub_queries Vector of LogtypeQuery objects to process
 * @param var_begin_ix [out] The resulting start index of the union
 * @param var_end_ix [out] The resulting end index of the union
 * @throw std::invalid_argument if sub_queries is empty [if applicable]
 */
216-220: Consider a more specific method name.

The current name get_union_of_bounds is somewhat vague. Consider a more specific name that indicates what type of bounds are being unified, such as get_union_of_variable_bounds or merge_query_variable_ranges.
components/core/src/glt/LogTypeDictionaryEntry.cpp (1)
206-233: LGTM! Consider enhancing documentation and readability.

The implementation is logically sound and correctly handles both placeholders and constants. However, a few improvements could make it more maintainable:

Consider adding documentation to explain the format of the human-readable output

The single-character representations ('v', 'f', 'i') for placeholder types could be made clearer using constants or an enum

Consider handling escape sequences in the output for better readability

Here's a suggested improvement:
+    // Constants for placeholder type representations
+    static const char DICT_PLACEHOLDER = 'v';
+    static const char FLOAT_PLACEHOLDER = 'f';
+    static const char INT_PLACEHOLDER = 'i';
+
     string human_readable_value;
 
     size_t constant_begin_pos = 0;
     for (size_t placeholder_ix = 0; placeholder_ix < get_num_placeholders(); ++placeholder_ix) {
         VariablePlaceholder placeholder;
         size_t placeholder_pos = get_placeholder_info(placeholder_ix, placeholder);
 
         // Add the constant that's between the last variable and this one, with newlines escaped
         human_readable_value
                 .append(m_value, constant_begin_pos, placeholder_pos - constant_begin_pos);
 
         if (VariablePlaceholder::Dictionary == placeholder) {
-            human_readable_value += "v";
+            human_readable_value += DICT_PLACEHOLDER;
         } else if (VariablePlaceholder::Float == placeholder) {
-            human_readable_value += "f";
+            human_readable_value += FLOAT_PLACEHOLDER;
         } else if (VariablePlaceholder::Integer == placeholder) {
-            human_readable_value += "i";
+            human_readable_value += INT_PLACEHOLDER;
         }
Also, consider adding documentation in the header file:
/**
 * @brief Generates a human-readable representation of the log type entry
 * @return A string where placeholders are represented as:
 *         - 'v' for Dictionary variables
 *         - 'f' for Float variables
 *         - 'i' for Integer variables
 *         Constants between placeholders are preserved as-is
 */
string get_human_readable_value() const;
components/core/src/glt/Grep.cpp (3)

513-541: Enhance error handling in the retokenization function

The retokenization function may not handle all edge cases, such as empty tokens or improper escape sequences.

Consider adding input validation and error handling to manage unexpected inputs, ensuring robustness in token processing.

1423-1430: Optimize data loading based on variable boundaries

In the search_segment_optimized_and_output function, partial columns are loaded using var_begin_ix and var_end_ix.

Consider verifying that only necessary data is loaded, potentially improving performance. Ensure that load_partial_columns correctly handles the specified variable ranges.

1470-1487: Improve handling of disjoint variable ranges in get_union_of_bounds

The current implementation merges variable boundaries by taking the minimum and maximum indices, which may include unrelated variables.

Consider refining the function to handle disjoint ranges more efficiently. This could avoid processing irrelevant variables and enhance performance.

Implementing a more sophisticated range merging strategy may optimize variable handling across subqueries.

components/core/src/glt/Query.hpp (2)

68-74: Encapsulate QueryBoundary member variables by making them private.

To uphold encapsulation principles, consider making var_begin_ix and var_end_ix private. Provide public getter methods if external access is required.

150-155: Complete and clarify the documentation for set_logtype_boundary.

The comment block appears incomplete and may cause confusion. Ensure that the documentation fully describes the method's purpose, parameters, and behaviour.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d524ead and 9c60bd5.

📒 Files selected for processing (8)

components/core/src/glt/Grep.cpp (11 hunks)
components/core/src/glt/Grep.hpp (1 hunks)
components/core/src/glt/LogTypeDictionaryEntry.cpp (1 hunks)
components/core/src/glt/LogTypeDictionaryEntry.hpp (1 hunks)
components/core/src/glt/Query.cpp (2 hunks)
components/core/src/glt/Query.hpp (5 hunks)
components/core/src/glt/glt/search.cpp (1 hunks)
components/core/src/glt/streaming_archive/reader/Archive.cpp (1 hunks)

🧰 Additional context used

🪛 cppcheck

components/core/src/glt/Grep.cpp

[error] 422-422: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)

[error] 449-449: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)

[error] 509-509: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)

[error] 585-585: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)

🔇 Additional comments (10)

components/core/src/glt/Query.cpp (2)

178-178: LGTM: Proper cleanup of logtype boundaries

The addition of m_logtype_boundaries.clear() ensures proper cleanup of boundary data when resetting the sub-query state.

222-222: Verify member variable initialization

The method now uses m_var_begin_ix and m_var_end_ix instead of hardcoded values. Please ensure these member variables are properly initialized.

Let's verify the initialization:

components/core/src/glt/Grep.hpp (1)

216-220: Verify error handling for edge cases.

Please clarify how the method handles these edge cases:

Empty sub_queries vector
Non-overlapping bounds
Invalid indices in the input queries

components/core/src/glt/Grep.cpp (4)

609-609: Verify assumptions about tokens with wildcards

The comment on line 609 suggests that if query_token.is_var() returns false, the token is text only with * in it.

Please verify that this assumption holds true in all cases. There may be scenarios where tokens are misidentified, potentially leading to incorrect query processing.

648-651: Check integration of retokenization and boundary calculations

The integration of retokenization and find_boundaries starting at line 650 should be verified to ensure that token boundaries are accurately calculated.

Please ensure that retokenization produces the expected tokens and that find_boundaries correctly computes the variable boundaries based on these tokens.

1086-1091: Ensure accurate construction of LogtypeQuery with boundaries

When constructing LogtypeQuery at lines 1086-1091, ensure that the boundaries obtained are valid and correctly associated with the corresponding logtype_id.

Please verify that get_boundary_by_logtype_id returns correct boundaries and handle any potential exceptions or invalid data.

Line range hint 1345-1358: Pass variable boundaries to find_message_matching_with_logtype_query_from_combined

At lines 1345-1358, the variables var_begin_ix and var_end_ix are computed and passed to find_message_matching_with_logtype_query_from_combined.

Ensure that these boundaries are accurately calculated and that they correctly influence the search within the combined table. Verify that the function handles these parameters appropriately.

components/core/src/glt/Query.hpp (3)

6-6: Include necessary header for std::unordered_map.

The inclusion of <unordered_map> is appropriate since std::unordered_map is utilized later in the code.

247-255: Constructor update for LogtypeQuery with QueryBoundary is appropriate.

Integrating QueryBoundary into the constructor and initializing m_var_begin_ix and m_var_end_ix enhances the management of variable boundaries.

267-270: Addition of getter methods for variable indices enhances encapsulation.

Providing get_begin_ix() and get_end_ix() allows controlled access to m_var_begin_ix and m_var_end_ix, promoting encapsulation.

coderabbitai · 2024-11-13T12:18:40Z

components/core/src/glt/LogTypeDictionaryEntry.hpp

+    /**
+     * Generate a human readable version of value.
+     * @param decompressor
+     */
+    std::string get_human_readable_value() const;


⚠️ Potential issue

Fix incorrect documentation for get_human_readable_value()

The method documentation appears to be incorrect:

It mentions a "decompressor" parameter that doesn't exist

The @param tag is present but the method takes no parameters

Apply this diff to fix the documentation:

/** - * Generate a human readable version of value. - * @param decompressor + * Generates a human-readable version of the log type entry by combining + * constant segments with placeholder indicators. + * @return A string representation of the log type entry */ std::string get_human_readable_value() const;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

/**

* Generate a human readable version of value.

* @param decompressor

*/

std::string get_human_readable_value() const;

/**

* Generates a human-readable version of the log type entry by combining

* constant segments with placeholder indicators.

* @return A string representation of the log type entry

*/

std::string get_human_readable_value() const;

💡 Codebase verification

Implementation of get_human_readable_value does not utilize m_placeholder_positions.

🔗 Analysis chain

Verify implementation matches the header declaration

The method declaration looks correct, but let's verify its implementation matches the intended functionality.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Verify the implementation of get_human_readable_value() # Expected: Find the implementation and verify it uses m_value and placeholder info # Search for the implementation rg -U "std::string\s+LogTypeDictionaryEntry::get_human_readable_value\s+const\s*\{[\s\S]+?\}" components/core/src/glt/ # Check if the implementation uses required member variables rg -U "get_human_readable_value.*\{[\s\S]+?m_(value|placeholder_positions)" components/core/src/glt/

Length of output: 1364

coderabbitai · 2024-11-13T12:18:40Z

components/core/src/glt/Query.cpp

+void SubQuery::set_logtype_boundary(
+        glt::logtype_dictionary_id_t logtype_id,
+        size_t var_begin_ix,
+        size_t var_end_ix
+) {
+    m_logtype_boundaries.emplace(logtype_id, QueryBoundary(var_begin_ix, var_end_ix));


🛠️ Refactor suggestion

Add parameter validation and documentation

The new method needs several improvements:

Add parameter validation to ensure var_begin_ix <= var_end_ix

Add documentation explaining the purpose and parameters

Consider handling duplicate logtype_id entries

Consider this implementation:

+/** + * Sets the variable boundary for a specific logtype + * @param logtype_id The ID of the logtype to set boundaries for + * @param var_begin_ix The starting index of variables (inclusive) + * @param var_end_ix The ending index of variables (exclusive) + * @throws std::invalid_argument if var_begin_ix > var_end_ix + */ void SubQuery::set_logtype_boundary( glt::logtype_dictionary_id_t logtype_id, size_t var_begin_ix, size_t var_end_ix ) { + if (var_begin_ix > var_end_ix) { + throw std::invalid_argument("var_begin_ix must be less than or equal to var_end_ix"); + } + + // Use insert_or_assign to make the overwrite behaviour explicit + m_logtype_boundaries.insert_or_assign( + logtype_id, + QueryBoundary(var_begin_ix, var_end_ix) + ); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

void SubQuery::set_logtype_boundary(

glt::logtype_dictionary_id_t logtype_id,

size_t var_begin_ix,

size_t var_end_ix

) {

m_logtype_boundaries.emplace(logtype_id, QueryBoundary(var_begin_ix, var_end_ix));

/**

* Sets the variable boundary for a specific logtype

* @param logtype_id The ID of the logtype to set boundaries for

* @param var_begin_ix The starting index of variables (inclusive)

* @param var_end_ix The ending index of variables (exclusive)

* @throws std::invalid_argument if var_begin_ix > var_end_ix

*/

void SubQuery::set_logtype_boundary(

glt::logtype_dictionary_id_t logtype_id,

size_t var_begin_ix,

size_t var_end_ix

) {

if (var_begin_ix > var_end_ix) {

throw std::invalid_argument("var_begin_ix must be less than or equal to var_end_ix");

}

// Use insert_or_assign to make the overwrite behaviour explicit

m_logtype_boundaries.insert_or_assign(

logtype_id,

QueryBoundary(var_begin_ix, var_end_ix)

);

}

coderabbitai · 2024-11-13T12:18:40Z

components/core/src/glt/glt/search.cpp

@@ -374,7 +374,7 @@ static size_t search_segments(
        );

        // first search through the single variable table
-        num_matches += Grep::search_segment_and_output(
+        num_matches += Grep::search_segment_optimized_and_output(


💡 Codebase verification

Methods search_segment_and_output and search_segment_optimized_and_output not found

The specified methods could not be located in the codebase. Please verify their implementation.

🔗 Analysis chain

Please provide context for the optimization change

The change from search_segment_and_output to search_segment_optimized_and_output appears to be part of the GLT search optimization effort. However, since this PR is in draft status and the description is TBD:

Could you elaborate on the specific optimizations implemented in the new method?

Have you conducted performance benchmarks to measure the improvement?

Let's verify the implementation of the optimized method:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Search for the implementation and usage of both methods echo "Searching for the original method implementation..." ast-grep --pattern 'search_segment_and_output($$$)' echo "Searching for the optimized method implementation..." ast-grep --pattern 'search_segment_optimized_and_output($$$)' echo "Checking for any performance benchmarks..." fd -e cpp -e hpp -e h benchmark

Length of output: 474

coderabbitai · 2024-11-13T12:18:41Z

components/core/src/glt/streaming_archive/reader/Archive.cpp

+                logtype_table.get_next_row(
+                        vars_to_load,
+                        possible_sub_query.get_begin_ix(),
+                        possible_sub_query.get_end_ix()
+                );


💡 Codebase verification

Missing Index Validation in LogtypeQuery Class

The methods get_begin_ix and get_end_ix in LogtypeQuery do not perform bounds checking, which may lead to invalid indices being used. Please implement appropriate validation to ensure index safety.

🔗 Analysis chain

Optimization looks good - Verify index validation

The optimization to load only the required columns using specific index ranges is a good improvement for performance. This should reduce memory usage and processing time by avoiding loading unnecessary columns.

Let's verify that the indices are properly validated in the LogtypeQuery class:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Check index validation in LogtypeQuery class # Expected: Find index validation in get_begin_ix and get_end_ix methods # Search for index validation in LogtypeQuery class ast-grep --pattern 'class LogtypeQuery { $$$ get_begin_ix() { $$$ } $$$ get_end_ix() { $$$ } $$$ }' # Search for any bounds checking rg -A 5 'get_(begin|end)_ix'

Length of output: 1850

coderabbitai · 2024-11-13T12:18:41Z

components/core/src/glt/Grep.cpp

+                // GLT TODO
                // Actually this is incorrect, because it's possible user enters 23412*34 aiming to
-                // match 23412.34. This should be an ambigious type.
+                // match 23412.34. we should consider the possibility that middle wildcard causes
+                // the converts_to_non_dict_var to be false.
                m_type = Type::DictionaryVar;


⚠️ Potential issue

Address the TODO comment and handle middle wildcard cases

There is a TODO comment indicating that the current logic may not correctly handle patterns like 23412*34 intended to match 23412.34. Middle wildcards might cause converts_to_non_dict_var to be false incorrectly.

Please implement the necessary logic to account for middle wildcards affecting variable conversion. Would you like assistance in refining this logic or proposing a solution?

coderabbitai · 2024-11-13T12:18:41Z

components/core/src/glt/Grep.cpp

+            printf("Unexpected\n");
+            throw;


⚠️ Potential issue

Correct use of throw; without an active exception

In the escape_decoder function, after printing "Unexpected", throw; is used without an active exception at line 585. This will cause std::terminate() to be called.

Replace throw; with an appropriate exception:

- throw; + throw std::runtime_error("Unexpected character encountered in escape_decoder");

Ensure that <stdexcept> is included if not already present.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

printf("Unexpected\n");

throw;

printf("Unexpected\n");

throw std::runtime_error("Unexpected character encountered in escape_decoder");

🧰 Tools

🪛 cppcheck

[error] 585-585: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)

coderabbitai · 2024-11-13T12:18:41Z

components/core/src/glt/Grep.cpp

+    size_t find_start_index{0};
+    bool tokens_contain_variable{false};
+    for (auto const& token : tokens) {
+        auto const& token_str = token.first;
+        bool contains_variable = token.second;
+        size_t found_index = logtype_string.find(token_str, find_start_index);
+        if (string::npos == found_index) {
+            printf("failed to find: [%s] from %s\n",
+                   token_str.c_str(),
+                   logtype_string.substr(find_start_index).c_str());
+            throw;
+        }
+        // the first time we see a token with variable, we know that
+        //  we don't care about the variables in the substr before this token in the logtype.
+        //  Technically, logtype_string.substr[0, token[begin_index])
+        //  (since token[begin_index] is the beginning of the token)
+        if (contains_variable) {
+            tokens_contain_variable = true;
+            left_boundary = found_index;
+            break;
+        }
+        // else, the token doesn't contain a variable
+        // we can proceed by skipping this token.
+        find_start_index = found_index + token_str.length();
+    }
+
+    // second, match the token from back
+    size_t rfind_end_index = logtype_string.length();
+    for (auto it = tokens.rbegin(); it != tokens.rend(); ++it) {
+        auto const& token_str = it->first;
+        bool contains_var = it->second;
+
+        size_t rfound_index = logtype_string.rfind(token_str, rfind_end_index);
+        if (string::npos == rfound_index) {
+            printf("failed to find: [%s] from %s\n",
+                   token_str.c_str(),
+                   logtype_string.substr(0, rfind_end_index).c_str());
+            throw;
+        }
+
+        // the first time we see a token with variable, we know that
+        // we don't care about the variables in the substr after this token in the logtype.
+        // Technically, logtype_string.substr[rfound_index + len(token), end)
+        // since logtype_string[rfound_index] is the beginning of the token
+        if (contains_var) {
+            tokens_contain_variable = true;
+            right_boundary = rfound_index + token_str.length();
+            break;
+        }
+
+        // Note, rfind end index is inclusive. has to subtract by 1 so
+        // in the next rfind, we skip the token we have already seen.
+        rfind_end_index = rfound_index - 1;
+    }
+
+    // if we didn't find any variable, we can do an early return
+    if (false == tokens_contain_variable) {
+        var_begin_ix = logtype_entry->get_num_variables();
+        var_end_ix = 0;
+        return;
+    }
+
+    // Now we have the left boundary and right boundary, try to filter out the variables;
+    // var_begin_ix is an inclusive interval
+    auto const logtype_variable_num = logtype_entry->get_num_variables();
+    ir::VariablePlaceholder var_placeholder;
+    var_begin_ix = 0;
+    for (size_t var_ix = 0; var_ix < logtype_variable_num; var_ix++) {
+        size_t var_position = logtype_entry->get_variable_info(var_ix, var_placeholder);
+        if (var_position < left_boundary) {
+            // if the variable is within the left boundary, then it should be skipped.
+            var_begin_ix++;
+        } else {
+            // if the variable is not within the left boundary
+            break;
+        }
+    }
+
+    // For right boundary, var_end_ix is an exclusive interval
+    var_end_ix = logtype_variable_num;
+    for (size_t var_ix = 0; var_ix < logtype_variable_num; var_ix++) {
+        size_t reversed_ix = logtype_variable_num - 1 - var_ix;
+        size_t var_position = logtype_entry->get_variable_info(reversed_ix, var_placeholder);
+        if (var_position >= right_boundary) {
+            // if the variable is within the right boundary, then it should be skipped.
+            var_end_ix--;
+        } else {
+            // if the variable is not within the right boundary
+            break;
+        }
+    }
+
+    if (var_end_ix <= var_begin_ix) {
+        printf("tokens contain a variable, end index %lu is smaller and equal than begin index "
+               "%lu\n",
+               var_end_ix,
+               var_begin_ix);
+        throw;
+    }
+}


⚠️ Potential issue

Fix improper use of throw; without an active exception

In the find_boundaries function, throw; is used without an active exception context at lines 422, 449, and 509. In C++, using throw; without a current exception will result in std::terminate() being called.

Apply the following diff to throw an appropriate exception:

- throw; + throw std::runtime_error("Failed to find token in logtype_string");

Ensure that <stdexcept> is included:

+#include <stdexcept>

Repeat this change for all instances where throw; is used without an active exception.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 cppcheck

[error] 422-422: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)

[error] 449-449: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)

[error] 509-509: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)

coderabbitai · 2024-11-13T12:18:41Z

components/core/src/glt/Query.hpp

+    QueryBoundary const& get_boundary_by_logtype_id(logtype_dictionary_id_t logtype_id) const {
+        return m_logtype_boundaries.at(logtype_id);
+    }


⚠️ Potential issue

Handle missing keys in get_boundary_by_logtype_id to prevent exceptions.

Using m_logtype_boundaries.at(logtype_id) throws a std::out_of_range exception if logtype_id is not present. To avoid potential runtime exceptions, consider checking for the key's existence before accessing it.

Apply this diff to handle missing keys gracefully:

QueryBoundary const& get_boundary_by_logtype_id(logtype_dictionary_id_t logtype_id) const { - return m_logtype_boundaries.at(logtype_id); + auto it = m_logtype_boundaries.find(logtype_id); + if (it != m_logtype_boundaries.end()) { + return it->second; + } else { + // Handle the absence of logtype_id appropriately + // Possible options: throw a custom exception, return a default value, or assert + throw std::invalid_argument("logtype_id not found in m_logtype_boundaries"); + } }

Committable suggestion skipped: line range outside the PR's diff.

haiqi96 added 30 commits January 15, 2024 19:05

Completely duplicate CLP to prepare for GLT

96f18d5

rename namespace in the duplicated codebase

19dbadb

Rough support compression

fd94018

Fix bugs in compression

1196327

Rough support decompression

9718d56

Fix size calculation

1cf9bac

Preliminary support for non-optimized search

693ad94

Preliminary support for optimized search

979b029

index magic to handle the fact var_position gets updated to placeholder

a6f2025

Fix GLT specific timestamp issue

2b8c883

Add get variable info for now.

6becc48

Run linter

7366ed5

Fix variable placeholder

a44ecad

Update argument interface

8f41624

Some clean and linter

46725f4

Remove logsurgeon and unused libs

5b76807

rearrange class variables methods

8ad0793

Mark TODOs

50b79ba

Compress file dict

0617c48

linter fix

11fd9b7

Remove gltg and move search into glt binary

c63cccb

Fix output method code and hide output method option from user.

61b3eb8

Remove prematured optimization

66275da

Update readme

6702c9d

Add comments and tokenization code

ff5b61f

commit find boundary function

93808f0

support optimization. except that escape is not well supported yet

02b4a30

Small fix and utilities

87880f8

Fix include and indexing boundary case for find left boundary

1e69b99

Run linter

e9fde16

haiqi96 added 5 commits January 23, 2024 01:42

Handle a corner case where none of the token contains variable.

f12aa15

Support escape properly

7db1315

Remove unused string utils

d698c01

Merge branch 'main' into glt_optimization

5c033f4

refactor comments to make the PR less confusing

9c60bd5

kirkrodrigues added the on hold On hold temporarily label Nov 13, 2024

coderabbitai bot reviewed Nov 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Optimization for GLT search #252

Draft: Optimization for GLT search #252

haiqi96 commented Jan 26, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 13, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

kirkrodrigues commented Nov 13, 2024

coderabbitai bot left a comment

coderabbitai bot Nov 13, 2024

coderabbitai bot Nov 13, 2024

coderabbitai bot Nov 13, 2024

coderabbitai bot Nov 13, 2024

coderabbitai bot Nov 13, 2024

coderabbitai bot Nov 13, 2024

coderabbitai bot Nov 13, 2024

coderabbitai bot Nov 13, 2024

Draft: Optimization for GLT search #252

Are you sure you want to change the base?

Draft: Optimization for GLT search #252

Conversation

haiqi96 commented Jan 26, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Nov 13, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

kirkrodrigues commented Nov 13, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 13, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 13, 2024

Choose a reason for hiding this comment

haiqi96 commented Jan 26, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 13, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)