-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix null byte \x00
issue by switching to numba.types.unicode_type
#904
Conversation
@M0gician please let me know if you would like any help on the tests. Edit: ProblemLooking further into it, the core issue is that hex representations in before it was an iterable of length 1. So when we compare to Looking into a clean way to fix this. Solution InvestigationPerhaps we can create a format where the byte (
Then we can treat null-prefixed characters as bytes, and everything else as characters. No major tokenizers use tokens containing characters containing null bytes (they're all following the UTF-8 standard), so there wouldn't be an collisions. Script:
Output:
|
I am currently busy this week so my bandwidth is limited. I tried to step into the exceptions but did not find anything decisive yet. If you got some time to check the test that will definitely help. |
Could you please review lapp0@8e168c6 This includes a new token / symbol representation format introduced in my comment above. Before we would have an array with utf-8 symbols and hex codes. We would distinguish these based on how many characters the symbol has. One character implies the character is a utf-8 character, two implies it's a hex representation of a byte.
Because we use
This demonstrates a problem, we have no idea whether consecutive hex characters represent a byte or two separate utf-8 characters. To resolve this, we prefix a hex-byte with a null byte.
This allows us to avoid the issues with numba Processing tokens symbol-by-symbol is inefficient, especially when you're applying conditional handling within This final change improved runtime from ~210% of TODO:
|
Fixes #833
Changes
numba.types.unicode_type
inoutlines.fsm.regex
implementation to avoid a bug innumba.unicodecharseq
which its boxer has an invalid skip on null-byte causing the copy to end prematurely.Test
I've used this patch for about a month. So far everything is working well without any problem.