-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a trie for scanning during index construction #507
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need profiling measurements to see which kinds of improvements the Numba changes introduce.
|
Great! We would need to profile the code in |
|
Feel free to open a separate issue and/or PR for that idea. |
This comment was marked as duplicate.
This comment was marked as duplicate.
Thanks for the results! Can you determine the performance gains for each change (i.e. the |
Please see updated description. |
Use of a token trie resulted in Previously it was called 449,307 times (vocab size is 49,923) Updated the description with these benchmarks. A trie is beneficial except when the regex is accepting of many tokens in many states. |
9e313df
to
6b4f5a4
Compare
Before I thoroughly review the PR:
|
I think the best way forward here is to separate the trie and the Numba update to make conversations and evaluation easier. We might also need to add a benchmarking suite to track these changes over time in the library. |
903a084
to
ced4bc1
Compare
ced4bc1
to
547f1e8
Compare
Done
Done
Ran "first-second-run" experiment twice on each branch, totalling 8 runs of the script above. PR:
Master:
|
I'm afraid we cannot merge something that makes first-time compilation slower. We can always make the following calls faster by caching the index construction, as is now the case on As above I suggest breaking down this PR in two PRs: one for the Numba change and one for the trie so we can evaluate them independently. Unless you know which change is responsible for the slow down, in which case I would advise to reverse it and benchmark with the other one. |
Any thoughts on AOT compilation? Only down-side I see is if distributed by Docker it would bind the library to a specific architecture. |
Let's split the PR before discussing approach-specific optimization. |
A long-term viable AOT compilation approach in Numba isn't very clear right now. They're in the process of deprecating the current approach, and I believe the replacement is going to be PIXIE. |
I'm going to create a PR with some benchmarks using |
The benchmarks from #542 indicate this would result in compilation times ranging from a 30% increase to a 70% decrease, on average decreasing the time. It's especially beneficial for complex FSMs with many states. Can't rebase at the moment, but happy to review a rebase, please ping me for PR. |
Closing in favor of #887 |
Update
state_scan_tokens
to use a trie. We are performing an expensive_walk_fsm
for every token * every state, which is the current bottleneck. This reduces the calls to_walk_fsm
.Performance
Author of interegular pushed my interegular
FSM.reduce()
optimization to pypi in version0.3.3
.(For simple regex,
interegular
reduce()
doesn't take much time, so the difference between 0.3.3 and 0.3.2 is mostly variation in runtime of the rest of RegexFSM.)interegular
changeinteregular
change + PR's numbaficationinteregular
change + PR's numbafication + PR's token trieBenchmarks indicate all RegexFSM constructions are faster than before. Worst sample is 20% faster, best sample is 700% faster.
Benchmarks
email:
complex_phone:
simple_phone:
date:
time:
ip:
url:
ssn:
very_complex:
json_schema:
complex_json_schema:
Benchmark code: