Fix null byte `\x00` issue in byte level fsm resulting in `KeyError` in `BetterFSM::FSMInfo` #930

lapp0 · 2024-05-30T22:54:41Z

Follow-up to #904 by @M0gician -- Closes #904

Fixes #833

Problem 1

@M0gician identified a bug in numba's handling of UnicodeCharSeq. if the elements first item is \x00, the entire element is null. This numba bug resulted in a KeyError while compiling a patterns index for certain characters.

Solution 1

@M0gician implemented the base solution - represent token bytes with numba.unicode_type to avoid this bug.

4a5ef55

Problem 2

Use of unicode_type results in ambiguous representation of hex-bytes / chars.

E.g. in main a UnicodeCharSeq will keep the hex bytes and characters separate["😇", "9", "F", "9F"]

Because we use unicode_type instead a List[UnicodeCharSeq(2)], the example sequence would be represented as

["😇", "9", "F", "9F"] -> '😇9F9F'

This demonstrates a problem, we have no idea whether consecutive hex characters represent a byte or two separate utf-8 characters.

Solution 2

To resolve this, we prefix a hex-byte with a null byte.

["😇", "9", "F", "9F"] -> '😇9F\x009F'.

83c4d3a

Problem 3

Parsing a token to separate bytes and chars during _walk_fsm is inefficient and resulted in index compilation taking ~2.2x as long as main

Solution 3

Instead of iterating over a string in _walk_fsm, we precompute a mapping of token -> transition key seq via get_tokens_transition_keys() and iterate over list of integer transition keys for each token in _walk_fsm()

56ea957

Now instead of taking ~2.2x as long as main, index compilation takes ~0.65x as long as main. However Numba takes 16% longer to compile.

Benchmark Suite Results:

Benchmarks that have improved:

	Before [`0b4d12b`]	After [`fe5cb3d`]	Ratio	Benchmark (Parameter)
-	8.03±0.1s	5.29±0.02s	0.66	JsonSchemaBenchmark.time_json_schema_to_fsm('complex_schema')
-	5.86±0.06s	3.72±0.07s	0.64	JsonSchemaBenchmark.time_json_schema_to_fsm('simple_schema')
-	4.41±0.07s	2.77±0.01s	0.63	RegexGuideBenchmark.time_regex_to_guide('complex_phone')
-	9.57±0.01s	6.37±0.02s	0.67	RegexGuideBenchmark.time_regex_to_guide('complex_span_constrained_relation_extraction')
-	4.20±0.02s	2.63±0.01s	0.63	RegexGuideBenchmark.time_regex_to_guide('date')
-	4.12±0.04s	2.56±0.01s	0.62	RegexGuideBenchmark.time_regex_to_guide('email')
-	4.07±0.04s	2.54±0.01s	0.62	RegexGuideBenchmark.time_regex_to_guide('ip')
-	3.97±0.08s	2.48±0.03s	0.62	RegexGuideBenchmark.time_regex_to_guide('simple_phone')
-	3.89±0.02s	2.41±0.02s	0.62	RegexGuideBenchmark.time_regex_to_guide('ssn')
-	3.96±0.03s	2.49±0.01s	0.63	RegexGuideBenchmark.time_regex_to_guide('time')
-	4.22±0.04s	2.67±0.01s	0.63	RegexGuideBenchmark.time_regex_to_guide('url')

Benchmarks that have stayed the same:

Before [`0b4d12b`]	After [`fe5cb3d`]	Ratio	Benchmark (Parameter)
88.4±1μs	86.5±1μs	0.98	bench_json_schema.JsonSchemaBenchmark.time_json_schema_to_regex('complex_schema')
49.2±0.4μs	48.6±0.3μs	0.99	JsonSchemaBenchmark.time_json_schema_to_regex('simple_schema')
598M	600M	1	MemoryRegexGuideBenchmark.peakmem_regex_to_guide('complex_span_constrained_relation_extraction')
496M	499M	1.01	MemoryRegexGuideBenchmark.peakmem_regex_to_guide('simple_phone')

Benchmarks that have got worse:

	Before [`0b4d12b`]	After [`fe5cb3d`]	Ratio	Benchmark (Parameter)
+	4.98±0.05s	5.77±0.03s	1.16	NumbaCompileBenchmark.time_compile_numba

Performance degradation detected!

M0gician · 2024-05-31T17:14:14Z

Should we add some test cases like the one in #833 to verify if this \x00 issue is solved in later updates as well? I am a little concerned whether numba's bug fix will break the current implementation in the future.

M0gician · 2024-06-01T08:20:38Z

Yes exactly

…

________________________________ From: Andrew Lapp ***@***.***> Sent: Saturday, June 1, 2024 3:14:48 AM To: outlines-dev/outlines ***@***.***> Cc: Tommy Yang ***@***.***>; Mention ***@***.***> Subject: Re: [outlines-dev/outlines] Fix null byte `\x00` issue in byte level fsm resulting in `KeyError` in `BetterFSM::FSMInfo` (PR #930) Should we add some test cases like the one in #833<#833> to verify if this \x00 issue is solved in later updates as well? I am a little concerned whether numba's bug fix will break the current implementation in the future. You mean a test case to ensure that a unicode_type starting with \x00 isn't broken in numba in the same way it breaks UnicodeCharSeq? — Reply to this email directly, view it on GitHub<#930 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADJEHYLTECZINZ37C5OIELLZFDD2RAVCNFSM6AAAAABIRYPGICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBSHA2DOMJTGI>. You are receiving this because you were mentioned.Message ID: ***@***.***>

lapp0 · 2024-06-01T18:10:26Z

I have introduced two new tests to verify the behavior:

`def test_numba_leading_null_byte_UnicodeCharSeq_remains_broken()`

This test asserts that numba/numba#9542 is still a problem. If this test fails we can safely use UnicodeCharSeq to represent byte fsms.

UnicodeCharSeq allows for a cleaner representation of byte fsms than unicode_type, so its preferable that we use it, but this test failing doesn't necessarily indicate anything is broken.

`def test_numba_leading_null_byte_unicode_type_sane()`

This test asserts that the unicode_type representation of null bytes (introduced in this PR) is legal.

If this fails, then we should consider reverting to use of UnicodeCharSeq (assuming the previous test also fails, indicating UnicodeCharSeq is viable).

…h \x00

lapp0 · 2024-06-04T22:06:44Z

tests/fsm/test_parsing.py

@@ -9,7 +9,14 @@
 from outlines.fsm.parsing import PartialLark, PartialPythonIndenter


-def test_partial_parsing():
+@pytest.fixture
+def cleanup_lark_import():


This change isn't directly related to the PR, but fixes an issue - if the tests in test_parsing.py fail, it cascades and causes a large number of unrelated tests to fail.

lapp0 marked this pull request as draft May 30, 2024 23:32

lapp0 force-pushed the solve-833 branch 3 times, most recently from 8ce451e to da2608d Compare May 31, 2024 16:52

lapp0 force-pushed the solve-833 branch from da2608d to 135793d Compare June 1, 2024 18:13

lapp0 mentioned this pull request Jun 1, 2024

Use a trie to speed up index construction #887

Closed

M0gician and others added 4 commits June 4, 2024 16:57

Fix null byte \x00 issue by switching to numba.types.unicode_type

222a5d2

ensure byte fsm unicode_type compatibility by prefixing hex-bytes wit…

5ba4c2b

…h \x00

index token -> transition key sequence for efficient _walk_fsm

22d3ddd

pass token_transition_sequence to walk_fsm in parsing.py

ae9635c

lapp0 force-pushed the solve-833 branch from 135793d to c314cb8 Compare June 4, 2024 21:58

add assertions for numba behavior re: UnicodeCharSeq / unicode_type

a759728

lapp0 force-pushed the solve-833 branch from 0ad30e0 to a759728 Compare June 4, 2024 22:02

lapp0 commented Jun 4, 2024

View reviewed changes

Add docstrings to get_token/vocab_transition_keys, improve naming

1e39027

lapp0 marked this pull request as ready for review June 4, 2024 22:32

lapp0 added 2 commits June 4, 2024 17:34

remove cleanup step redundant with cleanup_lark_import

4468180

Improve naming of token and vocab transition keys in regex.py

501520a

rlouf merged commit ed44a47 into dottxt-ai:main Jun 5, 2024
7 checks passed

rlouf added structured generation Linked to structured generation numba bug labels Jun 5, 2024

hnyls2002 mentioned this pull request Jun 16, 2024

Fix the Jump-Forward with Chinese sgl-project/sglang#551

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix null byte `\x00` issue in byte level fsm resulting in `KeyError` in `BetterFSM::FSMInfo` #930

Fix null byte `\x00` issue in byte level fsm resulting in `KeyError` in `BetterFSM::FSMInfo` #930

lapp0 commented May 30, 2024 •

edited

Loading

M0gician commented May 31, 2024

M0gician commented Jun 1, 2024 via email

lapp0 commented Jun 1, 2024 •

edited

Loading

lapp0 Jun 4, 2024

Fix null byte \x00 issue in byte level fsm resulting in KeyError in BetterFSM::FSMInfo #930

Fix null byte \x00 issue in byte level fsm resulting in KeyError in BetterFSM::FSMInfo #930

Conversation

lapp0 commented May 30, 2024 • edited Loading

Problem 1

Solution 1

Problem 2

Solution 2

Problem 3

Solution 3

M0gician commented May 31, 2024

M0gician commented Jun 1, 2024 via email

lapp0 commented Jun 1, 2024 • edited Loading

def test_numba_leading_null_byte_UnicodeCharSeq_remains_broken()

def test_numba_leading_null_byte_unicode_type_sane()

lapp0 Jun 4, 2024

Choose a reason for hiding this comment

Fix null byte `\x00` issue in byte level fsm resulting in `KeyError` in `BetterFSM::FSMInfo` #930

Fix null byte `\x00` issue in byte level fsm resulting in `KeyError` in `BetterFSM::FSMInfo` #930

lapp0 commented May 30, 2024 •

edited

Loading

lapp0 commented Jun 1, 2024 •

edited

Loading

`def test_numba_leading_null_byte_UnicodeCharSeq_remains_broken()`

`def test_numba_leading_null_byte_unicode_type_sane()`