Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix null byte \x00 issue in byte level fsm resulting in KeyError in BetterFSM::FSMInfo #930

Merged
merged 8 commits into from
Jun 5, 2024

Conversation

lapp0
Copy link
Contributor

@lapp0 lapp0 commented May 30, 2024

Follow-up to #904 by @M0gician -- Closes #904

Fixes #833

Problem 1

@M0gician identified a bug in numba's handling of UnicodeCharSeq. if the elements first item is \x00, the entire element is null. This numba bug resulted in a KeyError while compiling a patterns index for certain characters.

More problem details available in #904

Solution 1

@M0gician implemented the base solution - represent token bytes with numba.unicode_type to avoid this bug.

4a5ef55

Problem 2

Use of unicode_type results in ambiguous representation of hex-bytes / chars.

E.g. in main a UnicodeCharSeq will keep the hex bytes and characters separate["😇", "9", "F", "9F"]

Because we use unicode_type instead a List[UnicodeCharSeq(2)], the example sequence would be represented as

["😇", "9", "F", "9F"] -> '😇9F9F'

This demonstrates a problem, we have no idea whether consecutive hex characters represent a byte or two separate utf-8 characters.

Solution 2

To resolve this, we prefix a hex-byte with a null byte.

["😇", "9", "F", "9F"] -> '😇9F\x009F'.

83c4d3a

Problem 3

Parsing a token to separate bytes and chars during _walk_fsm is inefficient and resulted in index compilation taking ~2.2x as long as main

Solution 3

Instead of iterating over a string in _walk_fsm, we precompute a mapping of token -> transition key seq via get_tokens_transition_keys() and iterate over list of integer transition keys for each token in _walk_fsm()

56ea957

Now instead of taking ~2.2x as long as main, index compilation takes ~0.65x as long as main. However Numba takes 16% longer to compile.

Benchmark Suite Results:

Benchmarks that have improved:

Before [0b4d12b] After [fe5cb3d] Ratio Benchmark (Parameter)
- 8.03±0.1s 5.29±0.02s 0.66 JsonSchemaBenchmark.time_json_schema_to_fsm('complex_schema')
- 5.86±0.06s 3.72±0.07s 0.64 JsonSchemaBenchmark.time_json_schema_to_fsm('simple_schema')
- 4.41±0.07s 2.77±0.01s 0.63 RegexGuideBenchmark.time_regex_to_guide('complex_phone')
- 9.57±0.01s 6.37±0.02s 0.67 RegexGuideBenchmark.time_regex_to_guide('complex_span_constrained_relation_extraction')
- 4.20±0.02s 2.63±0.01s 0.63 RegexGuideBenchmark.time_regex_to_guide('date')
- 4.12±0.04s 2.56±0.01s 0.62 RegexGuideBenchmark.time_regex_to_guide('email')
- 4.07±0.04s 2.54±0.01s 0.62 RegexGuideBenchmark.time_regex_to_guide('ip')
- 3.97±0.08s 2.48±0.03s 0.62 RegexGuideBenchmark.time_regex_to_guide('simple_phone')
- 3.89±0.02s 2.41±0.02s 0.62 RegexGuideBenchmark.time_regex_to_guide('ssn')
- 3.96±0.03s 2.49±0.01s 0.63 RegexGuideBenchmark.time_regex_to_guide('time')
- 4.22±0.04s 2.67±0.01s 0.63 RegexGuideBenchmark.time_regex_to_guide('url')

Benchmarks that have stayed the same:

Before [0b4d12b] After [fe5cb3d] Ratio Benchmark (Parameter)
88.4±1μs 86.5±1μs 0.98 bench_json_schema.JsonSchemaBenchmark.time_json_schema_to_regex('complex_schema')
49.2±0.4μs 48.6±0.3μs 0.99 JsonSchemaBenchmark.time_json_schema_to_regex('simple_schema')
598M 600M 1 MemoryRegexGuideBenchmark.peakmem_regex_to_guide('complex_span_constrained_relation_extraction')
496M 499M 1.01 MemoryRegexGuideBenchmark.peakmem_regex_to_guide('simple_phone')

Benchmarks that have got worse:

Before [0b4d12b] After [fe5cb3d] Ratio Benchmark (Parameter)
+ 4.98±0.05s 5.77±0.03s 1.16 NumbaCompileBenchmark.time_compile_numba

Performance degradation detected!

@lapp0 lapp0 marked this pull request as draft May 30, 2024 23:32
@lapp0 lapp0 force-pushed the solve-833 branch 3 times, most recently from 8ce451e to da2608d Compare May 31, 2024 16:52
@M0gician
Copy link
Contributor

Should we add some test cases like the one in #833 to verify if this \x00 issue is solved in later updates as well? I am a little concerned whether numba's bug fix will break the current implementation in the future.

@M0gician
Copy link
Contributor

M0gician commented Jun 1, 2024 via email

@lapp0
Copy link
Contributor Author

lapp0 commented Jun 1, 2024

I have introduced two new tests to verify the behavior:

def test_numba_leading_null_byte_UnicodeCharSeq_remains_broken()

This test asserts that numba/numba#9542 is still a problem. If this test fails we can safely use UnicodeCharSeq to represent byte fsms.

UnicodeCharSeq allows for a cleaner representation of byte fsms than unicode_type, so its preferable that we use it, but this test failing doesn't necessarily indicate anything is broken.

def test_numba_leading_null_byte_unicode_type_sane()

This test asserts that the unicode_type representation of null bytes (introduced in this PR) is legal.

If this fails, then we should consider reverting to use of UnicodeCharSeq (assuming the previous test also fails, indicating UnicodeCharSeq is viable).

@@ -9,7 +9,14 @@
from outlines.fsm.parsing import PartialLark, PartialPythonIndenter


def test_partial_parsing():
@pytest.fixture
def cleanup_lark_import():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change isn't directly related to the PR, but fixes an issue - if the tests in test_parsing.py fail, it cascades and causes a large number of unrelated tests to fail.

@lapp0 lapp0 marked this pull request as ready for review June 4, 2024 22:32
@rlouf rlouf merged commit ed44a47 into dottxt-ai:main Jun 5, 2024
7 checks passed
@rlouf rlouf added structured generation Linked to structured generation numba bug labels Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug numba structured generation Linked to structured generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KeyError in BetterFSM::FSMInfo when input FSM alphabet contains UTF-8 characters that ends with \xb8\x80
3 participants