Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core-clp: Rewrite wildcard matching method and add systematic unit tests (fixes #427). #428

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

kirkrodrigues
Copy link
Member

Description

This PR:

  • rewrites clp::string_utils::wildcard_match_unsafe_case_sensitive both to fix clp::string_utils::wildcard_match_unsafe_case_sensitive fails to match with certain queries #427, to slightly simplify the logic, and to add more comments to explain the algorithm;
  • completely rewrites the unit tests for wildcard matching so that they (hopefully) systematically test all cases;
  • replaces the wildcard performance test to be more realistic by matching against several lines (rather than a single line) from an example log file.

Validation performed

Validated unit tests passed.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed implementation. Overall I think it is now much easier to understand compared to the previous implementation.

components/core/src/clp/string_utils/string_utils.cpp Outdated Show resolved Hide resolved
Comment on lines +237 to +243
// Handle `tame` or `wild` being empty
if (wild.empty()) {
return tame.empty();
}
if (tame.empty()) {
return "*" == wild;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not moving the empty check to the beginning of the function?


// Handle boundary conditions
if (tame_end_it == tame_it) {
return (wild_end_it == wild_it) || (wild_end_it == wild_it + 1 && '*' == *wild_it);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is repeated in the next for loop. How about we make a helper, sth like is_wild_reaching_end_or_trailing_star? (maybe we can discuss to come up with a better name)

// Handle boundary conditions
if (tame_end_it == tame_it) {
return (wild_end_it == wild_it) || (wild_end_it == wild_it + 1 && '*' == *wild_it);
} else if (wild_end_it == wild_it) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think using a new if instead of else if would be more clear? Essentially these are two different cases to handle.

Comment on lines 324 to 325
} else if (wild_end_it == wild_it) {
if (tame_end_it == tame_it) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if tame_end_it == tame_it and wild_end_it == wild_it, we should already return in the above if right?

Comment on lines +329 to +335
// Reset to bookmarks
tame_it = tame_bookmark_it + 1;
wild_it = wild_bookmark_it;
if (false
== advance_tame_to_next_match(tame_end_it, tame_it, tame_bookmark_it, wild_it))
{
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong: if this branch is triggered, wild has already reached the end without consuming the entire tame. We should be handling the last group of tame after the last *. In this case, we only need to match the last n characters (determined by wild_it - wild_bookmark_it, and properly counting escape chars in between) in tame right? For example, if tame is "aaaaaaaa" and wild is "*a", we don't have to advance tame to match every single "a" but jump to match the last one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Are you suggesting, in this case, that we should iterate backwards from the end of tame to see if it matches the last group in wild?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, iterating backwards is non-trivial because of escaped characters. If we see a ? in wild, we have to check the character before it to know if it's an escape character. But even if it is, we don't know if it's escaping the ? or it's preceded by an escape itself. So it's easier to always iterate forwards.

* @param tame_it Returns `tame`'s updated iterator.
* @param tame_bookmark_it Returns `tame`'s updated bookmark.
* @param wild_it Returns `wild`'s updated iterator.
* @return true on success, false if `tame` cannot match `wild`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @return true on success, false if `tame` cannot match `wild`.
* @return Whether `tame` can successfully match `wild`,

*
* NOTE:
* - This method expects that `tame_it` < `tame_end_it`
* - This method should be inlined for performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is enforced. afaik we are just suggesting the compiler to inline the method. But I'm also not sure whether we should use always_intline attribute since we may need to compile this file using none-gnu compilers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't enforced, but at the same time I don't know if forcing it to be inlined is necessary. The reason I said it should be inlined is because in past performance tests, the inline hint did make a difference. Nowadays though, it seems like gcc inlines it regardless of the hint.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two high level questions:

  1. The generation of the wild and tame is deterministic. Can we pre-generate them and add a script used for generation? Or we don't really care about the time cost
  2. Without annotating any baseline runtime or run time comparison, how do we detect performance regression in the updated performance test?

// possibilities---to generate this variety. For each wildcard string, we also generate one or
// more strings that can be matched by the wildcard string.

// The template below is meant to test 1-3 groups of WildcardStringAlphabets separated by '*'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the generated wild, * is added only at beginning three times (since i is set to iterate 3 times). Did we miss to add * in between when creating template_wildcard_str?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

return;
}

size_t const tame_size_before_modification = tame.size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_t const tame_size_before_modification = tame.size();
auto const tame_size_before_modification = tame.size();


size_t const tame_size_before_modification = tame.size();
auto alphabet = chosen_alphabets.front();
auto next_chosen_alphabets = chosen_alphabets.subspan(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto next_chosen_alphabets = chosen_alphabets.subspan(1);
auto const next_chosen_alphabets = chosen_alphabets.subspan(1);

return;
}

size_t const wild_size_before_modification = wild.size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_t const wild_size_before_modification = wild.size();
autoconst wild_size_before_modification = wild.size();

Comment on lines 210 to 211
string tame1{"a?c"};
string wild1{"*a\\??*"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use constexpr std::string_view?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were committed accidentally. I removed them now.

REQUIRE(wildcard_match_unsafe_case_sensitive(tameString, wildString) == true);
}
TEST_CASE("wildcard_match_unsafe", "[wildcard]") {
string tame{"0!2#4%6&8(aBcDeFgHiJkLmNoPqRsTuVwXyZ"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use constexpr std::string_view?

REQUIRE(wildcard_match_unsafe(tameString, wildString, isCaseSensitive) == true);
SCENARIO("wildcard_match_unsafe_case_sensitive performance", "[wildcard performance]") {
auto const tests_dir = std::filesystem::path{__FILE__}.parent_path();
auto log_file_path = tests_dir / "test_network_reader_src" / "random.log";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto log_file_path = tests_dir / "test_network_reader_src" / "random.log";
auto const log_file_path = tests_dir / "test_network_reader_src" / "random.log";

GIVEN("All lower case tame and mixed lower and upper case wild") {
tameString = "abcde", wildString = "A?c*";
REQUIRE(wildcard_match_unsafe(tameString, wildString, isCaseSensitive) == true);
SCENARIO("wildcard_match_unsafe_case_sensitive performance", "[wildcard performance]") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By reading the doc, we're not using any of GIVEN, THEN, or WHEN. Can we just make it a normal TEST_CASE?

}
}
}
auto end_timestamp = high_resolution_clock::now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto end_timestamp = high_resolution_clock::now();
auto const end_timestamp = high_resolution_clock::now();

bool isCaseSensitive = false;
allPassed &= wildcard_match_unsafe("mississippi", "*issip*PI", isCaseSensitive);
REQUIRE(allPassed == true);
auto begin_timestamp = high_resolution_clock::now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto begin_timestamp = high_resolution_clock::now();
auto const begin_timestamp = high_resolution_clock::now();

@kirkrodrigues
Copy link
Member Author

  1. The generation of the wild and tame is deterministic. Can we pre-generate them and add a script used for generation? Or we don't really care about the time cost

True. I can move the generation code to a Python script and pregenerate them. With the generation bug fix, it takes something like 15 minutes to finish, so pregeneration is definitely worthwhile.

  1. Without annotating any baseline runtime or run time comparison, how do we detect performance regression in the updated performance test?

The previous performance test didn't detect regressions since it was comparing against itself. I'm a little reluctant to set a hard performance target since that's machine specific. For now, I hope everyone who modifies (and reviews) this code will benchmark the implementation before and after to ensure they aren't introducing regressions.

That said, I do think there's a way to speed up and simplify the implementation, but that's for another PR.

@LinZhihao-723
Copy link
Member

  1. The generation of the wild and tame is deterministic. Can we pre-generate them and add a script used for generation? Or we don't really care about the time cost

True. I can move the generation code to a Python script and pregenerate them. With the generation bug fix, it takes something like 15 minutes to finish, so pregeneration is definitely worthwhile.

  1. Without annotating any baseline runtime or run time comparison, how do we detect performance regression in the updated performance test?

The previous performance test didn't detect regressions since it was comparing against itself. I'm a little reluctant to set a hard performance target since that's machine specific. For now, I hope everyone who modifies (and reviews) this code will benchmark the implementation before and after to ensure they aren't introducing regressions.

That said, I do think there's a way to speed up and simplify the implementation, but that's for another PR.

For 1: sure. I'm ok to delay it to a new PR.
For 2: I think it's worth adding a comment explaining our expectations about benchmarking the before/after performance

@kirkrodrigues
Copy link
Member Author

For 1: sure. I'm ok to delay it to a new PR.

All the cases end up being a 41GiB file, so I decided to just simplify the test case. I think it still covers everything and it's performant enough imo.

For 2: I think it's worth adding a comment explaining our expectations about benchmarking the before/after performance

Done.

@kirkrodrigues kirkrodrigues marked this pull request as draft November 13, 2024 12:18
@kirkrodrigues kirkrodrigues added the on hold On hold temporarily label Nov 13, 2024
Copy link
Contributor

coderabbitai bot commented Nov 13, 2024

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@kirkrodrigues
Copy link
Member Author

On hold while we do more benchmarking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on hold On hold temporarily
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clp::string_utils::wildcard_match_unsafe_case_sensitive fails to match with certain queries
2 participants