Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with --pattern-from #32

Open
Zer0-Tolerance opened this issue Nov 24, 2022 · 16 comments
Open

Performance issue with --pattern-from #32

Zer0-Tolerance opened this issue Nov 24, 2022 · 16 comments

Comments

@Zer0-Tolerance
Copy link

Hi Liz,

Another quick fix , when you're using the --pattern-from with regex you're doing a loop with each one of the pattern rather than compiling a single regex with all pattern with | this creates a huge performance penalty because you checking the same file over and over based on the number of line in the input file.
Based on my test on a large CSV 93 MB it takes 294 secs against a 99 regex pattern file vs 84 sec if you convert this as a regex with |.
Can you fix this ?

@lizmat
Copy link
Owner

lizmat commented Nov 24, 2022

Just not quite a quick fix, I'm afraid.

OOC, do any of those patterns consist of just looking for a literal string?

@lizmat
Copy link
Owner

lizmat commented Nov 24, 2022

A first stab at merging consecutive regexes into a single regex for multiple patterns

@lizmat
Copy link
Owner

lizmat commented Nov 24, 2022

0.2.14 now in the ecosystem. Please let me know how this affected the performance!

@Zer0-Tolerance
Copy link
Author

Great I'll test again and let you know ? All patterns are regex.

@Zer0-Tolerance
Copy link
Author

Zer0-Tolerance commented Nov 25, 2022

ok with the latest version it's better than before 129 sec now (as opposed to 284 sec) for the list of pattern vs 81 sec with the single regex with |.
This is the cmd I'm using:
time rak --ignore-case --patterns-from=/tmp/file.txt /path/file.csv

@lizmat
Copy link
Owner

lizmat commented Nov 25, 2022

ok, that's better. By any chance, are one or more of the regexes literal? Consisting of just / foo / ? Because those are currently converted internally to *.contains("foo"), and thus create multiple regexes, instead of just a single one. Which would explain the remaining difference

@Zer0-Tolerance
Copy link
Author

Sorry I forgot to add the type to the cmd so no pattern is interpreted as literal in my list right ?
time rak --ignore-case --type=regex --patterns-from=/tmp/file.txt /path/file.csv

@lizmat
Copy link
Owner

lizmat commented Nov 25, 2022

With a literal regex I mean / foo / where "foo" consists of only alphanumeric characters. I doubt that if you don't change the file, and then do a --type=regex that it would produce the right results, or even run at all??

@Zer0-Tolerance
Copy link
Author

my patterns consist of IP addresses , filenames.

@lizmat
Copy link
Owner

lizmat commented Nov 25, 2022

Could you give me an example of the IP address and a filename?

@Zer0-Tolerance
Copy link
Author

sure "8.8.8.8" and "test.exe" for example.

@lizmat
Copy link
Owner

lizmat commented Nov 25, 2022

When you have 8.8.8.8 in your regex, you actually mean the string "8.8.8.8", right? But "8a8b8c8" you would not want to match, right?

If that is so, why are you using regexes? Why not just the string "8.8.8.8" (which would do a much cheaper '.contains("8.8.8.8")' on the targets.

@Zer0-Tolerance
Copy link
Author

yes but I have also tests?.exe in the same list of pattern.

@lizmat
Copy link
Owner

lizmat commented Nov 27, 2022

ok, then I suggest grouping the ones with wildcards together, and see how that affects the performance. As the logic now will group together any consecutive regexes into a single regex

@lizmat
Copy link
Owner

lizmat commented Jul 29, 2024

FWIW, I'm keeping this one open until it's time to rewrite the expression parsing using RakuAST, and optimize from there

@lizmat
Copy link
Owner

lizmat commented Aug 17, 2024

This has now happened with 0.3.4 https://raku.land/zef:lizmat/App::Rak/changes?v=0.3.4 . This now builds a single Callable for all of the patterns found using RakuAST, not doing anything special for regexes yet.

Please confirm whether this makes things better or worse in your use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants