Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix null byte #21

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Fix null byte #21

wants to merge 9 commits into from

Conversation

lapp0
Copy link
Owner

@lapp0 lapp0 commented May 26, 2024

Copy link

github-actions bot commented May 26, 2024

Benchmark Suite Results:

Before [592c884] After [78f0974] Ratio Benchmark (Parameter)
4.89±0.04s 4.38±0.03s ~0.90 bench_numba_compile.NumbaCompileBenchmark.time_compile_numba
292±3ms 256±2ms ~0.87 bench_regex_guide.RegexGuideBenchmark.time_regex_to_guide('email')
252±7ms 212±1ms ~0.84 bench_regex_guide.RegexGuideBenchmark.time_regex_to_guide('ip')
423±7ms 349±1ms ~0.83 bench_regex_guide.RegexGuideBenchmark.time_regex_to_guide('url')
361±10ms 289±1ms ~0.80 bench_regex_guide.RegexGuideBenchmark.time_regex_to_guide('date')
641±10ms 449±2ms ~0.70 bench_regex_guide.RegexGuideBenchmark.time_regex_to_guide('complex_phone')
4.24±0.06s 2.91±0.01s ~0.69 bench_json_schema.JsonSchemaBenchmark.time_json_schema_to_fsm('complex_schema')
5.79±0.1s 3.98±0.01s ~0.69 bench_regex_guide.RegexGuideBenchmark.time_regex_to_guide('complex_span_constrained_relation_extraction')
2.08±0.01s 1.36±0.02s ~0.65 bench_json_schema.JsonSchemaBenchmark.time_json_schema_to_fsm('simple_schema')
127±3ms 134±0.7ms 1.06 bench_regex_guide.RegexGuideBenchmark.time_regex_to_guide('time')

@lapp0
Copy link
Owner Author

lapp0 commented May 26, 2024

Profile this branch

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.009    0.009   22.921   22.921 /home/andrew/p/outlines/profile_null_byte_fix.py:19(profile_email_guide)
        1    0.000    0.000   22.912   22.912 /home/andrew/p/outlines/outlines/fsm/guide.py:140(__init__)
        1    0.001    0.001   22.912   22.912 /home/andrew/p/outlines/outlines/caching.py:113(wrapper)
        1    0.000    0.000   22.829   22.829 /home/andrew/p/outlines/outlines/fsm/guide.py:108(create_states_mapping)
        1    0.015    0.015   21.851   21.851 /home/andrew/p/outlines/outlines/fsm/regex.py:853(create_fsm_index_tokenizer)
        1    0.411    0.411   21.762   21.762 /home/andrew/p/outlines/outlines/fsm/regex.py:709(create_fsm_index_end_to_end)
      389   21.128    0.054   21.130    0.054 /home/andrew/p/outlines/outlines/fsm/regex.py:672(state_scan_tokens)
      523    0.228    0.000    0.846    0.002 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:1023(crawl)
     10/1    0.001    0.000    0.838    0.838 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:447(to_fsm)
    128/2    0.001    0.000    0.745    0.372 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:453(<genexpr>)
    118/1    0.003    0.000    0.745    0.745 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:370(to_fsm)
     13/1    0.000    0.000    0.681    0.681 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:280(to_fsm)
      131    0.002    0.000    0.355    0.003 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:364(concatenate)
       10    0.000    0.000    0.288    0.029 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:505(union)
       10    0.000    0.000    0.288    0.029 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:967(parallel)
   114135    0.190    0.000    0.190    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:979(follow)
   198170    0.159    0.000    0.176    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:401(follow)
       13    0.000    0.000    0.123    0.009 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:418(__add__)
        1    0.000    0.000    0.116    0.116 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:249(reduce)
        1    0.000    0.000    0.116    0.116 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:253(reduce_brzozowski)
        2    0.004    0.002    0.115    0.058 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:612(reversed)
  1153581    0.104    0.000    0.104    0.000 {method 'add' of 'set' objects}
      125    0.000    0.000    0.102    0.001 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:428(star)

profile main

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.007    0.007   10.290   10.290 /home/andrew/p/outlines/profile_null_byte_fix.py:19(profile_email_guide)
        1    0.000    0.000   10.283   10.283 /home/andrew/p/outlines/outlines/fsm/guide.py:140(__init__)
        1    0.001    0.001   10.283   10.283 /home/andrew/p/outlines/outlines/caching.py:113(wrapper)
        1    0.000    0.000   10.213   10.213 /home/andrew/p/outlines/outlines/fsm/guide.py:108(create_states_mapping)
        1    0.014    0.014    9.228    9.228 /home/andrew/p/outlines/outlines/fsm/regex.py:829(create_fsm_index_tokenizer)
        1    0.355    0.355    9.160    9.160 /home/andrew/p/outlines/outlines/fsm/regex.py:684(create_fsm_index_end_to_end)
      389    8.644    0.022    8.646    0.022 /home/andrew/p/outlines/outlines/fsm/regex.py:651(state_scan_tokens)
     10/1    0.001    0.000    0.865    0.865 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:447(to_fsm)
      523    0.197    0.000    0.733    0.001 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:1023(crawl)
    128/2    0.001    0.000    0.642    0.321 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:453(<genexpr>)
    118/1    0.002    0.000    0.641    0.641 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:370(to_fsm)
     13/1    0.000    0.000    0.590    0.590 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:280(to_fsm)
      131    0.002    0.000    0.301    0.002 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:364(concatenate)
       10    0.000    0.000    0.258    0.026 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:505(union)
       10    0.000    0.000    0.258    0.026 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:967(parallel)
      269    0.010    0.000    0.197    0.001 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:112(union)
    801/1    0.001    0.000    0.169    0.169 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:69(get_alphabet)
     10/1    0.000    0.000    0.169    0.169 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:423(_get_alphabet)
    128/2    0.000    0.000    0.169    0.085 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:425(<genexpr>)
    118/1    0.000    0.000    0.169    0.169 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:330(_get_alphabet)
    787/2    0.000    0.000    0.169    0.085 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:331(<genexpr>)
     13/1    0.000    0.000    0.169    0.169 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:270(_get_alphabet)
   114135    0.169    0.000    0.169    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:979(follow)
   198170    0.135    0.000    0.148    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:401(follow)
      269    0.143    0.001    0.143    0.001 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:118(<dictcomp>)
       13    0.000    0.000    0.104    0.008 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:418(__add__)
        1    0.000    0.000    0.100    0.100 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:249(reduce)
        1    0.000    0.000    0.100    0.100 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:253(reduce_brzozowski)
        2    0.003    0.002    0.100    0.050 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:612(reversed)

profile script

import cProfile
import pstats
from io import StringIO

from outlines.models.transformers import TransformerTokenizer
from transformers import AutoTokenizer
from outlines.fsm.guide import RegexGuide


tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer = TransformerTokenizer(tokenizer)


# ensure numba compiled
RegexGuide("a", tokenizer)

pattern = "(['\"\\ ,]?((?:of|resulting|case|which|cultures|a|core|extreme|selflessness|spiritual|various|However|both|vary|in|other|secular|the|religious|among|moral|and|It|object|worldviews|altruism|traditional|material|aspect|or|life|beings|virtue|is|however|opposite|concern|an|practice|it|for|s|quality|religions|In|Altruism|animals|happiness|many|become|principle|human|selfishness|may|synonym)['\"\\ ,]?)+['\"\\ ,]?\\s\\|\\s([^|\\(\\)\n]{1,})\\s\\|\\s['\"\\ ,]?((?:of|resulting|case|which|cultures|a|core|extreme|selflessness|spiritual|various|However|both|vary|in|other|secular|the|religious|among|moral|and|It|object|worldviews|altruism|traditional|material|aspect|or|life|beings|virtue|is|however|opposite|concern|an|practice|it|for|s|quality|religions|In|Altruism|animals|happiness|many|become|principle|human|selfishness|may|synonym)['\"\\ ,]?)+['\"\\ ,]?(\\s\\|\\s\\(([^|\\(\\)\n]{1,})\\s\\|\\s([^|\\(\\)\n]{1,})\\))*\\n)*"

def profile_email_guide():
    RegexGuide(pattern, tokenizer)


# Create a profiler
profiler = cProfile.Profile()

# Run the code with the profiler
profiler.enable()
profile_email_guide()
profiler.disable()

# Create a stream to hold profiling statistics
s = StringIO()
sortby = 'cumulative'
ps = pstats.Stats(profiler, stream=s).sort_stats(sortby)

# Print the profiling statistics
ps.print_stats()

# Display the profiling statistics
print(s.getvalue())

The culprit is my modified implementation of _walk_fsm (<- state_scan_tokens). Will investigate how to optimize.

@lapp0
Copy link
Owner Author

lapp0 commented May 27, 2024

I incorporated an index which converts tokens into a sequence of transition keys and now it's slightly faster than main!

New profile:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015   10.837   10.837 /home/andrew/p/outlines/profile_null_byte_fix.py:24(profile_email_guide)
        1    0.000    0.000   10.822   10.822 /home/andrew/p/outlines/outlines/fsm/guide.py:140(__init__)
        1    0.001    0.001   10.822   10.822 /home/andrew/p/outlines/outlines/caching.py:113(wrapper)
        1    0.001    0.001   10.821   10.821 /home/andrew/p/outlines/outlines/fsm/guide.py:108(create_states_mapping)
        1    0.023    0.023    9.681    9.681 /home/andrew/p/outlines/outlines/fsm/regex.py:877(create_fsm_index_tokenizer)
        1    0.488    0.488    9.556    9.556 /home/andrew/p/outlines/outlines/fsm/regex.py:726(create_fsm_index_end_to_end)
      389    8.745    0.022    8.747    0.022 /home/andrew/p/outlines/outlines/fsm/regex.py:659(state_scan_tokens)
      523    0.259    0.000    0.976    0.002 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:1023(crawl)
     10/1    0.002    0.000    0.972    0.972 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:447(to_fsm)
    128/2    0.001    0.000    0.859    0.429 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:453(<genexpr>)
    118/1    0.003    0.000    0.859    0.859 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:370(to_fsm)
     13/1    0.000    0.000    0.791    0.791 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:280(to_fsm)
      131    0.002    0.000    0.400    0.003 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:364(concatenate)
       10    0.000    0.000    0.352    0.035 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:505(union)
       10    0.001    0.000    0.352    0.035 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:967(parallel)
   114135    0.234    0.000    0.234    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:979(follow)
   198170    0.178    0.000    0.197    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:401(follow)
        1    0.000    0.000    0.138    0.138 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:249(reduce)
        1    0.001    0.001    0.138    0.138 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:253(reduce_brzozowski)
       13    0.000    0.000    0.137    0.011 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:418(__add__)
        2    0.005    0.003    0.137    0.069 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:612(reversed)
  1153581    0.122    0.000    0.122    0.000 {method 'add' of 'set' objects}
    24185    0.082    0.000    0.114    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:634(follow)
      125    0.000    0.000    0.109    0.001 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:428(star)
        1    0.101    0.101    0.101    0.101 /home/andrew/p/outlines/outlines/fsm/regex.py:695(get_tokens_trans_keys)
      269    0.016    0.000    0.081    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:112(union)
    67779    0.051    0.000    0.080    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:1031(get_hash)
        1    0.060    0.060    0.060    0.060 /home/andrew/p/outlines/outlines/fsm/regex.py:903(<dictcomp>)
   743602    0.055    0.000    0.055    0.000 {method 'setdefault' of 'dict' objects}
      269    0.011    0.000    0.049    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:114(<dictcomp>)
    65660    0.047    0.000    0.048    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:438(follow)
    801/1    0.002    0.000    0.042    0.042 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:69(get_alphabet)
     10/1    0.000    0.000    0.042    0.042 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:423(_get_alphabet)
    128/2    0.000    0.000    0.042    0.021 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:425(<genexpr>)
    118/1    0.001    0.000    0.042    0.042 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:330(_get_alphabet)
    787/2    0.000    0.000    0.042    0.021 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:331(<genexpr>)
     13/1    0.000    0.000    0.042    0.042 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:270(_get_alphabet)
      255    0.001    0.000    0.039    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:463(times)
    57272    0.016    0.000    0.039    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:114(<genexpr>)
      359    0.004    0.000    0.035    0.000 /nix/store/qd7h3vn2bff6jjigdvq0xh91q49sm1ng-python3.11-tqdm-4.66.4/lib/python3.11/site-packages/tqdm/std.py:1198(update)
       69    0.001    0.000    0.030    0.000 /nix/store/qd7h3vn2bff6jjigdvq0xh91q49sm1ng-python3.11-tqdm-4.66.4/lib/python3.11/site-packages/tqdm/std.py:1325(refresh)
        1    0.000    0.000    0.030    0.030 /home/andrew/p/outlines/outlines/models/transformers.py:113(__hash__)
        1    0.000    0.000    0.030    0.030 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/datasets/fingerprint.py:226(hash)

@lapp0 lapp0 force-pushed the fix-null-byte branch 2 times, most recently from 53d8a8d to 22cbed6 Compare May 27, 2024 06:57
@@ -419,17 +416,17 @@ def _walk_fsm(
alphabet_anything_value: int,
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need to be passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants