Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Optimization for GLT search #252

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

haiqi96
Copy link
Contributor

@haiqi96 haiqi96 commented Jan 26, 2024

TBD

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced query processing with improved handling of variable tokens and boundaries.
    • Introduced a method to generate human-readable representations of log type entries.
    • Added a new class for managing variable boundaries within queries.
  • Bug Fixes

    • Refined error handling for search operations and file reading, improving robustness.
  • Improvements

    • Optimized search functionality for better performance and reliability.
    • Updated methods to streamline variable matching and log type boundary management.

@kirkrodrigues kirkrodrigues added the on hold On hold temporarily label Nov 13, 2024
Copy link
Contributor

coderabbitai bot commented Nov 13, 2024

Walkthrough

The changes in this pull request involve multiple enhancements to the Grep, LogTypeDictionaryEntry, and Query classes, focusing on query token processing, boundary calculations, and log type handling. New methods for finding boundaries and retokenization are introduced, along with a new QueryBoundary class to manage variable boundaries. The search functionality is optimized, and error handling is improved across various components. These modifications collectively enhance the robustness and efficiency of the query processing system.

Changes

File Path Change Summary
components/core/src/glt/Grep.cpp Added methods: find_boundaries, retokenization, get_union_of_bounds. Updated generate_logtypes_and_vars_for_subquery logic.
components/core/src/glt/Grep.hpp Added static method: get_union_of_bounds.
components/core/src/glt/LogTypeDictionaryEntry.cpp Added method: get_human_readable_value for constructing a human-readable log type entry.
components/core/src/glt/LogTypeDictionaryEntry.hpp Added public method: get_human_readable_value.
components/core/src/glt/Query.cpp Added method: set_logtype_boundary in SubQuery. Updated clear and matches_vars methods for handling logtype boundaries.
components/core/src/glt/Query.hpp Introduced new class: QueryBoundary. Added methods: get_boundary_by_logtype_id, set_logtype_boundary. Updated LogtypeQuery constructor.
components/core/src/glt/glt/search.cpp Updated search_segments to use Grep::search_segment_optimized_and_output. Enhanced error handling in open_archive and search functions.
components/core/src/glt/streaming_archive/reader/Archive.cpp Modified find_message_matching_with_logtype_query_optimized to optimize row retrieval. Enhanced error handling in load_filename_dict.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Query
    participant Grep
    participant LogTypeDictionaryEntry
    participant Archive

    User->>Query: Initiate search
    Query->>Grep: Generate log types and variables
    Grep->>Grep: Find boundaries
    Grep->>Grep: Retokenization
    Grep->>LogTypeDictionaryEntry: Get human-readable value
    Query->>Archive: Load data
    Archive->>Query: Return data
    Query->>User: Display results
Loading

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@kirkrodrigues
Copy link
Member

Labelled on hold until we have resources to revisit this. Possible actions items are:

  • Describe the change (logic + testing + benchmark results)
  • Add any more changes based on new insights

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Outside diff range and nitpick comments (8)
components/core/src/glt/Grep.hpp (2)

216-220: Add documentation for the new method.

The new method lacks essential documentation. Please add a documentation block that explains:

  • The purpose of the method
  • The expected format and constraints of the input sub_queries
  • The meaning and valid ranges of var_begin_ix and var_end_ix
  • Any error conditions or edge cases that callers should be aware of

Consider this format:

/**
 * Computes the union of variable bounds across multiple sub-queries
 * @param sub_queries Vector of LogtypeQuery objects to process
 * @param var_begin_ix [out] The resulting start index of the union
 * @param var_end_ix [out] The resulting end index of the union
 * @throw std::invalid_argument if sub_queries is empty [if applicable]
 */

216-220: Consider a more specific method name.

The current name get_union_of_bounds is somewhat vague. Consider a more specific name that indicates what type of bounds are being unified, such as get_union_of_variable_bounds or merge_query_variable_ranges.

components/core/src/glt/LogTypeDictionaryEntry.cpp (1)

206-233: LGTM! Consider enhancing documentation and readability.

The implementation is logically sound and correctly handles both placeholders and constants. However, a few improvements could make it more maintainable:

  1. Consider adding documentation to explain the format of the human-readable output
  2. The single-character representations ('v', 'f', 'i') for placeholder types could be made clearer using constants or an enum
  3. Consider handling escape sequences in the output for better readability

Here's a suggested improvement:

+    // Constants for placeholder type representations
+    static const char DICT_PLACEHOLDER = 'v';
+    static const char FLOAT_PLACEHOLDER = 'f';
+    static const char INT_PLACEHOLDER = 'i';
+
     string human_readable_value;
 
     size_t constant_begin_pos = 0;
     for (size_t placeholder_ix = 0; placeholder_ix < get_num_placeholders(); ++placeholder_ix) {
         VariablePlaceholder placeholder;
         size_t placeholder_pos = get_placeholder_info(placeholder_ix, placeholder);
 
         // Add the constant that's between the last variable and this one, with newlines escaped
         human_readable_value
                 .append(m_value, constant_begin_pos, placeholder_pos - constant_begin_pos);
 
         if (VariablePlaceholder::Dictionary == placeholder) {
-            human_readable_value += "v";
+            human_readable_value += DICT_PLACEHOLDER;
         } else if (VariablePlaceholder::Float == placeholder) {
-            human_readable_value += "f";
+            human_readable_value += FLOAT_PLACEHOLDER;
         } else if (VariablePlaceholder::Integer == placeholder) {
-            human_readable_value += "i";
+            human_readable_value += INT_PLACEHOLDER;
         }

Also, consider adding documentation in the header file:

/**
 * @brief Generates a human-readable representation of the log type entry
 * @return A string where placeholders are represented as:
 *         - 'v' for Dictionary variables
 *         - 'f' for Float variables
 *         - 'i' for Integer variables
 *         Constants between placeholders are preserved as-is
 */
string get_human_readable_value() const;
components/core/src/glt/Grep.cpp (3)

513-541: Enhance error handling in the retokenization function

The retokenization function may not handle all edge cases, such as empty tokens or improper escape sequences.

Consider adding input validation and error handling to manage unexpected inputs, ensuring robustness in token processing.


1423-1430: Optimize data loading based on variable boundaries

In the search_segment_optimized_and_output function, partial columns are loaded using var_begin_ix and var_end_ix.

Consider verifying that only necessary data is loaded, potentially improving performance. Ensure that load_partial_columns correctly handles the specified variable ranges.


1470-1487: Improve handling of disjoint variable ranges in get_union_of_bounds

The current implementation merges variable boundaries by taking the minimum and maximum indices, which may include unrelated variables.

Consider refining the function to handle disjoint ranges more efficiently. This could avoid processing irrelevant variables and enhance performance.

Implementing a more sophisticated range merging strategy may optimize variable handling across subqueries.

components/core/src/glt/Query.hpp (2)

68-74: Encapsulate QueryBoundary member variables by making them private.

To uphold encapsulation principles, consider making var_begin_ix and var_end_ix private. Provide public getter methods if external access is required.


150-155: Complete and clarify the documentation for set_logtype_boundary.

The comment block appears incomplete and may cause confusion. Ensure that the documentation fully describes the method's purpose, parameters, and behaviour.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d524ead and 9c60bd5.

📒 Files selected for processing (8)
  • components/core/src/glt/Grep.cpp (11 hunks)
  • components/core/src/glt/Grep.hpp (1 hunks)
  • components/core/src/glt/LogTypeDictionaryEntry.cpp (1 hunks)
  • components/core/src/glt/LogTypeDictionaryEntry.hpp (1 hunks)
  • components/core/src/glt/Query.cpp (2 hunks)
  • components/core/src/glt/Query.hpp (5 hunks)
  • components/core/src/glt/glt/search.cpp (1 hunks)
  • components/core/src/glt/streaming_archive/reader/Archive.cpp (1 hunks)
🧰 Additional context used
🪛 cppcheck
components/core/src/glt/Grep.cpp

[error] 422-422: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)


[error] 449-449: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)


[error] 509-509: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)


[error] 585-585: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)

🔇 Additional comments (10)
components/core/src/glt/Query.cpp (2)

178-178: LGTM: Proper cleanup of logtype boundaries

The addition of m_logtype_boundaries.clear() ensures proper cleanup of boundary data when resetting the sub-query state.


222-222: Verify member variable initialization

The method now uses m_var_begin_ix and m_var_end_ix instead of hardcoded values. Please ensure these member variables are properly initialized.

Let's verify the initialization:

components/core/src/glt/Grep.hpp (1)

216-220: Verify error handling for edge cases.

Please clarify how the method handles these edge cases:

  • Empty sub_queries vector
  • Non-overlapping bounds
  • Invalid indices in the input queries
components/core/src/glt/Grep.cpp (4)

609-609: Verify assumptions about tokens with wildcards

The comment on line 609 suggests that if query_token.is_var() returns false, the token is text only with * in it.

Please verify that this assumption holds true in all cases. There may be scenarios where tokens are misidentified, potentially leading to incorrect query processing.


648-651: Check integration of retokenization and boundary calculations

The integration of retokenization and find_boundaries starting at line 650 should be verified to ensure that token boundaries are accurately calculated.

Please ensure that retokenization produces the expected tokens and that find_boundaries correctly computes the variable boundaries based on these tokens.


1086-1091: Ensure accurate construction of LogtypeQuery with boundaries

When constructing LogtypeQuery at lines 1086-1091, ensure that the boundaries obtained are valid and correctly associated with the corresponding logtype_id.

Please verify that get_boundary_by_logtype_id returns correct boundaries and handle any potential exceptions or invalid data.


Line range hint 1345-1358: Pass variable boundaries to find_message_matching_with_logtype_query_from_combined

At lines 1345-1358, the variables var_begin_ix and var_end_ix are computed and passed to find_message_matching_with_logtype_query_from_combined.

Ensure that these boundaries are accurately calculated and that they correctly influence the search within the combined table. Verify that the function handles these parameters appropriately.

components/core/src/glt/Query.hpp (3)

6-6: Include necessary header for std::unordered_map.

The inclusion of <unordered_map> is appropriate since std::unordered_map is utilized later in the code.


247-255: Constructor update for LogtypeQuery with QueryBoundary is appropriate.

Integrating QueryBoundary into the constructor and initializing m_var_begin_ix and m_var_end_ix enhances the management of variable boundaries.


267-270: Addition of getter methods for variable indices enhances encapsulation.

Providing get_begin_ix() and get_end_ix() allows controlled access to m_var_begin_ix and m_var_end_ix, promoting encapsulation.

Comment on lines +182 to +186
/**
* Generate a human readable version of value.
* @param decompressor
*/
std::string get_human_readable_value() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect documentation for get_human_readable_value()

The method documentation appears to be incorrect:

  • It mentions a "decompressor" parameter that doesn't exist
  • The @param tag is present but the method takes no parameters

Apply this diff to fix the documentation:

    /**
-     * Generate a human readable version of value.
-     * @param decompressor
+     * Generates a human-readable version of the log type entry by combining
+     * constant segments with placeholder indicators.
+     * @return A string representation of the log type entry
     */
    std::string get_human_readable_value() const;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/**
* Generate a human readable version of value.
* @param decompressor
*/
std::string get_human_readable_value() const;
/**
* Generates a human-readable version of the log type entry by combining
* constant segments with placeholder indicators.
* @return A string representation of the log type entry
*/
std::string get_human_readable_value() const;

💡 Codebase verification

Implementation of get_human_readable_value does not utilize m_placeholder_positions.

🔗 Analysis chain

Verify implementation matches the header declaration

The method declaration looks correct, but let's verify its implementation matches the intended functionality.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of get_human_readable_value()
# Expected: Find the implementation and verify it uses m_value and placeholder info

# Search for the implementation
rg -U "std::string\s+LogTypeDictionaryEntry::get_human_readable_value\(\)\s+const\s*\{[\s\S]+?\}" components/core/src/glt/

# Check if the implementation uses required member variables
rg -U "get_human_readable_value.*\{[\s\S]+?m_(value|placeholder_positions)" components/core/src/glt/

Length of output: 1364

Comment on lines +182 to +187
void SubQuery::set_logtype_boundary(
glt::logtype_dictionary_id_t logtype_id,
size_t var_begin_ix,
size_t var_end_ix
) {
m_logtype_boundaries.emplace(logtype_id, QueryBoundary(var_begin_ix, var_end_ix));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add parameter validation and documentation

The new method needs several improvements:

  1. Add parameter validation to ensure var_begin_ix <= var_end_ix
  2. Add documentation explaining the purpose and parameters
  3. Consider handling duplicate logtype_id entries

Consider this implementation:

+/**
+ * Sets the variable boundary for a specific logtype
+ * @param logtype_id The ID of the logtype to set boundaries for
+ * @param var_begin_ix The starting index of variables (inclusive)
+ * @param var_end_ix The ending index of variables (exclusive)
+ * @throws std::invalid_argument if var_begin_ix > var_end_ix
+ */
 void SubQuery::set_logtype_boundary(
         glt::logtype_dictionary_id_t logtype_id,
         size_t var_begin_ix,
         size_t var_end_ix
 ) {
+    if (var_begin_ix > var_end_ix) {
+        throw std::invalid_argument("var_begin_ix must be less than or equal to var_end_ix");
+    }
+    
+    // Use insert_or_assign to make the overwrite behaviour explicit
+    m_logtype_boundaries.insert_or_assign(
+        logtype_id,
+        QueryBoundary(var_begin_ix, var_end_ix)
+    );
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void SubQuery::set_logtype_boundary(
glt::logtype_dictionary_id_t logtype_id,
size_t var_begin_ix,
size_t var_end_ix
) {
m_logtype_boundaries.emplace(logtype_id, QueryBoundary(var_begin_ix, var_end_ix));
/**
* Sets the variable boundary for a specific logtype
* @param logtype_id The ID of the logtype to set boundaries for
* @param var_begin_ix The starting index of variables (inclusive)
* @param var_end_ix The ending index of variables (exclusive)
* @throws std::invalid_argument if var_begin_ix > var_end_ix
*/
void SubQuery::set_logtype_boundary(
glt::logtype_dictionary_id_t logtype_id,
size_t var_begin_ix,
size_t var_end_ix
) {
if (var_begin_ix > var_end_ix) {
throw std::invalid_argument("var_begin_ix must be less than or equal to var_end_ix");
}
// Use insert_or_assign to make the overwrite behaviour explicit
m_logtype_boundaries.insert_or_assign(
logtype_id,
QueryBoundary(var_begin_ix, var_end_ix)
);
}

@@ -374,7 +374,7 @@ static size_t search_segments(
);

// first search through the single variable table
num_matches += Grep::search_segment_and_output(
num_matches += Grep::search_segment_optimized_and_output(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Methods search_segment_and_output and search_segment_optimized_and_output not found

The specified methods could not be located in the codebase. Please verify their implementation.

🔗 Analysis chain

Please provide context for the optimization change

The change from search_segment_and_output to search_segment_optimized_and_output appears to be part of the GLT search optimization effort. However, since this PR is in draft status and the description is TBD:

  1. Could you elaborate on the specific optimizations implemented in the new method?
  2. Have you conducted performance benchmarks to measure the improvement?

Let's verify the implementation of the optimized method:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for the implementation and usage of both methods
echo "Searching for the original method implementation..."
ast-grep --pattern 'search_segment_and_output($$$)' 

echo "Searching for the optimized method implementation..."
ast-grep --pattern 'search_segment_optimized_and_output($$$)'

echo "Checking for any performance benchmarks..."
fd -e cpp -e hpp -e h benchmark

Length of output: 474

Comment on lines +410 to +414
logtype_table.get_next_row(
vars_to_load,
possible_sub_query.get_begin_ix(),
possible_sub_query.get_end_ix()
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Missing Index Validation in LogtypeQuery Class

The methods get_begin_ix and get_end_ix in LogtypeQuery do not perform bounds checking, which may lead to invalid indices being used. Please implement appropriate validation to ensure index safety.

🔗 Analysis chain

Optimization looks good - Verify index validation

The optimization to load only the required columns using specific index ranges is a good improvement for performance. This should reduce memory usage and processing time by avoiding loading unnecessary columns.

Let's verify that the indices are properly validated in the LogtypeQuery class:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check index validation in LogtypeQuery class
# Expected: Find index validation in get_begin_ix and get_end_ix methods

# Search for index validation in LogtypeQuery class
ast-grep --pattern 'class LogtypeQuery {
  $$$
  get_begin_ix() {
    $$$
  }
  $$$
  get_end_ix() {
    $$$
  }
  $$$
}'

# Search for any bounds checking
rg -A 5 'get_(begin|end)_ix'

Length of output: 1850

Comment on lines +164 to 168
// GLT TODO
// Actually this is incorrect, because it's possible user enters 23412*34 aiming to
// match 23412.34. This should be an ambigious type.
// match 23412.34. we should consider the possibility that middle wildcard causes
// the converts_to_non_dict_var to be false.
m_type = Type::DictionaryVar;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Address the TODO comment and handle middle wildcard cases

There is a TODO comment indicating that the current logic may not correctly handle patterns like 23412*34 intended to match 23412.34. Middle wildcards might cause converts_to_non_dict_var to be false incorrectly.

Please implement the necessary logic to account for middle wildcards affecting variable conversion. Would you like assistance in refining this logic or proposing a solution?

Comment on lines +584 to +585
printf("Unexpected\n");
throw;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct use of throw; without an active exception

In the escape_decoder function, after printing "Unexpected", throw; is used without an active exception at line 585. This will cause std::terminate() to be called.

Replace throw; with an appropriate exception:

-        throw;
+        throw std::runtime_error("Unexpected character encountered in escape_decoder");

Ensure that <stdexcept> is included if not already present.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
printf("Unexpected\n");
throw;
printf("Unexpected\n");
throw std::runtime_error("Unexpected character encountered in escape_decoder");
🧰 Tools
🪛 cppcheck

[error] 585-585: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)

Comment on lines +397 to +511
size_t find_start_index{0};
bool tokens_contain_variable{false};
for (auto const& token : tokens) {
auto const& token_str = token.first;
bool contains_variable = token.second;
size_t found_index = logtype_string.find(token_str, find_start_index);
if (string::npos == found_index) {
printf("failed to find: [%s] from %s\n",
token_str.c_str(),
logtype_string.substr(find_start_index).c_str());
throw;
}
// the first time we see a token with variable, we know that
// we don't care about the variables in the substr before this token in the logtype.
// Technically, logtype_string.substr[0, token[begin_index])
// (since token[begin_index] is the beginning of the token)
if (contains_variable) {
tokens_contain_variable = true;
left_boundary = found_index;
break;
}
// else, the token doesn't contain a variable
// we can proceed by skipping this token.
find_start_index = found_index + token_str.length();
}

// second, match the token from back
size_t rfind_end_index = logtype_string.length();
for (auto it = tokens.rbegin(); it != tokens.rend(); ++it) {
auto const& token_str = it->first;
bool contains_var = it->second;

size_t rfound_index = logtype_string.rfind(token_str, rfind_end_index);
if (string::npos == rfound_index) {
printf("failed to find: [%s] from %s\n",
token_str.c_str(),
logtype_string.substr(0, rfind_end_index).c_str());
throw;
}

// the first time we see a token with variable, we know that
// we don't care about the variables in the substr after this token in the logtype.
// Technically, logtype_string.substr[rfound_index + len(token), end)
// since logtype_string[rfound_index] is the beginning of the token
if (contains_var) {
tokens_contain_variable = true;
right_boundary = rfound_index + token_str.length();
break;
}

// Note, rfind end index is inclusive. has to subtract by 1 so
// in the next rfind, we skip the token we have already seen.
rfind_end_index = rfound_index - 1;
}

// if we didn't find any variable, we can do an early return
if (false == tokens_contain_variable) {
var_begin_ix = logtype_entry->get_num_variables();
var_end_ix = 0;
return;
}

// Now we have the left boundary and right boundary, try to filter out the variables;
// var_begin_ix is an inclusive interval
auto const logtype_variable_num = logtype_entry->get_num_variables();
ir::VariablePlaceholder var_placeholder;
var_begin_ix = 0;
for (size_t var_ix = 0; var_ix < logtype_variable_num; var_ix++) {
size_t var_position = logtype_entry->get_variable_info(var_ix, var_placeholder);
if (var_position < left_boundary) {
// if the variable is within the left boundary, then it should be skipped.
var_begin_ix++;
} else {
// if the variable is not within the left boundary
break;
}
}

// For right boundary, var_end_ix is an exclusive interval
var_end_ix = logtype_variable_num;
for (size_t var_ix = 0; var_ix < logtype_variable_num; var_ix++) {
size_t reversed_ix = logtype_variable_num - 1 - var_ix;
size_t var_position = logtype_entry->get_variable_info(reversed_ix, var_placeholder);
if (var_position >= right_boundary) {
// if the variable is within the right boundary, then it should be skipped.
var_end_ix--;
} else {
// if the variable is not within the right boundary
break;
}
}

if (var_end_ix <= var_begin_ix) {
printf("tokens contain a variable, end index %lu is smaller and equal than begin index "
"%lu\n",
var_end_ix,
var_begin_ix);
throw;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix improper use of throw; without an active exception

In the find_boundaries function, throw; is used without an active exception context at lines 422, 449, and 509. In C++, using throw; without a current exception will result in std::terminate() being called.

Apply the following diff to throw an appropriate exception:

-            throw;
+            throw std::runtime_error("Failed to find token in logtype_string");

Ensure that <stdexcept> is included:

+#include <stdexcept>

Repeat this change for all instances where throw; is used without an active exception.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 cppcheck

[error] 422-422: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)


[error] 449-449: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)


[error] 509-509: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)

Comment on lines +145 to +147
QueryBoundary const& get_boundary_by_logtype_id(logtype_dictionary_id_t logtype_id) const {
return m_logtype_boundaries.at(logtype_id);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle missing keys in get_boundary_by_logtype_id to prevent exceptions.

Using m_logtype_boundaries.at(logtype_id) throws a std::out_of_range exception if logtype_id is not present. To avoid potential runtime exceptions, consider checking for the key's existence before accessing it.

Apply this diff to handle missing keys gracefully:

     QueryBoundary const& get_boundary_by_logtype_id(logtype_dictionary_id_t logtype_id) const {
-        return m_logtype_boundaries.at(logtype_id);
+        auto it = m_logtype_boundaries.find(logtype_id);
+        if (it != m_logtype_boundaries.end()) {
+            return it->second;
+        } else {
+            // Handle the absence of logtype_id appropriately
+            // Possible options: throw a custom exception, return a default value, or assert
+            throw std::invalid_argument("logtype_id not found in m_logtype_boundaries");
+        }
     }

Committable suggestion skipped: line range outside the PR's diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on hold On hold temporarily
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants