Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CFGGuide to use outlines.fsm.parsing. Enable generate.cfg #1067

Merged
merged 3 commits into from
Aug 31, 2024

Conversation

lapp0
Copy link
Collaborator

@lapp0 lapp0 commented Jul 25, 2024

Rendered Docs

Fixes:

Changes

CFGGuide

  • Created a stateless CFGGuide based on @brandonwillard's implementation in examples/parsing.py
  • Update outlines.fsm.parsing to handle some edge cases
    • Implement accepts() and feed_eof() for termination checking.
    • Bug fix: before fix, tokens which exceeded the bounds of the terminal, but had no matching subsequent terminal candidate were still marked as valid.
  • Delete CFGFSM and CFGFSM tests

Grammars

  • Fix ESCAPED_STRING in json.lark and common.lark

Integrations

  • Implement outlines.generate.cfg(...) via SequenceGeneratorAdapter
  • Implement outlines.processors.CFGLogitsProcessor

Testing

tests/fsm/test_cfg_guide.py

  • test_cfg_next_token: assert that given a sequence of prior tokens generated, the expected next tokens in a vocabulary are allowed.
  • test_cfg_grammar_sample: Provided a sample valid with a grammar, token_ids = tokenizer.encode(sample) Assert that token_ids can be produced by CFGGuide. Allows for a new test to be created by simply adding an example to tests/cfg_samples/

Test outlines.generate.cfg via tests/generate/test_generate.py

Update tests/fsm/test_guide.py to test for CFGGuide.must_terminate_state() and CFGGuide.can_terminate_state()

Benchmarks

benchmarks/bench_cfg_guide.py: measure CFGGuide construction time, token run time, and token run peak-memory

Analysis

Using gpt2 tokenizer: regardless of length, 10 tokens, 40 tokens, or 100 tokens, it takes ~1.2 seconds to generate a token.

Unsurprisingly get_next_instruction takes most of the time, totaling over 99.99% of the runtime. It's intuitive considering the same operation is applied for get_next_state, but for a single token instead of once for each of gpt2's 50,257 tokens.

Breakdown:

  • lexing takes ~58% of the time (no copying involved in lexing)
  • copying takes ~26% of the time
  • every thing else takes ~16% of the time

cProfile:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  140.176  140.176 {built-in method builtins.exec}
        1    0.000    0.000  140.176  140.176 <string>:1(<module>)
        1    0.003    0.003  140.176  140.176 /home/andrew/p/outlines/profile_cfg.py:15(profile_guide_run)
       40    3.736    0.093  140.159    3.504 /home/andrew/p/outlines/outlines/fsm/guide.py:324(get_next_instruction)
  1758994    0.785    0.000   92.318    0.000 /home/andrew/p/outlines/outlines/fsm/parsing.py:140(parse_from_state)
  1758994    2.055    0.000   91.533    0.000 /home/andrew/p/outlines/outlines/fsm/parsing.py:482(parse_from_state)
  2917115    2.304    0.000   81.840    0.000 /home/andrew/p/outlines/outlines/fsm/parsing.py:630(lex)
  2916020    8.630    0.000   79.111    0.000 /home/andrew/p/outlines/outlines/fsm/parsing.py:696(next_token)
11708177/2913032   11.354    0.000   39.208    0.000 /nix/store/4rf5qybw37b4lh1g0xczlv14sqdbmnpm-python3-3.11.9/lib/python3.11/copy.py:66(copy)
  1759029    2.429    0.000   36.344    0.000 /home/andrew/p/outlines/outlines/fsm/parsing.py:453(__copy__)
  2329598    1.455    0.000   33.410    0.000 /home/andrew/p/outlines/outlines/fsm/parsing.py:693(match)
  2329598    8.935    0.000   31.644    0.000 /home/andrew/p/outlines/outlines/fsm/parsing.py:562(match)
  1759029    1.509    0.000   24.687    0.000 /home/andrew/p/outlines/outlines/fsm/parsing.py:145(__copy__)
  1555742    2.393    0.000   15.974    0.000 /home/andrew/p/outlines/outlines/fsm/parsing.py:545(get_terminals_info)
  2312124    1.152    0.000   14.318    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/lark/lexer.py:202(__new__)
  3111484    7.513    0.000   13.219    0.000 /home/andrew/p/outlines/outlines/fsm/regex.py:619(get_sub_fsms_from_seq)
  2312124    1.477    0.000   13.166    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/lark/lexer.py:213(_future_new)
  8160509    1.637    0.000   12.493    0.000 {built-in method __new__ of type object at 0x7fee64db5340}
  1759029    1.347    0.000   12.287    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/lark/lexer.py:427(__copy__)
2309943/1154085    2.419    0.000   10.857    0.000 /nix/store/4rf5qybw37b4lh1g0xczlv14sqdbmnpm-python3-3.11.9/lib/python3.11/dataclasses.py:233(wrapper)
  3518058    5.254    0.000    8.725    0.000 /nix/store/4rf5qybw37b4lh1g0xczlv14sqdbmnpm-python3-3.11.9/lib/python3.11/copy.py:259(_reconstruct)
  2329614    1.714    0.000    8.284    0.000 /nix/store/mrp9s742bpjwv7lb3rv3ikv8qx72nj0d-python3.11-numba-0.59.1/lib/python3.11/site-packages/numba/core/dispatcher.py:724(typeof_pyval)
  1158121    1.975    0.000    7.087    0.000 /home/andrew/p/outlines/outlines/fsm/parsing.py:362(feed_token)
  2329598    5.019    0.000    6.705    0.000 /home/andrew/p/outlines/outlines/fsm/regex.py:465(walk_fsm)
  2329882    1.531    0.000    6.388    0.000 /nix/store/mrp9s742bpjwv7lb3rv3ikv8qx72nj0d-python3.11-numba-0.59.1/lib/python3.11/site-packages/numba/core/typing/typeof.py:27(typeof)
  2329598    5.240    0.000    6.380    0.000 /home/andrew/p/outlines/outlines/fsm/regex.py:694(get_token_transition_keys)
  3111484    3.793    0.000    5.577    0.000 /home/andrew/p/outlines/outlines/fsm/regex.py:646(<genexpr>)
  1759029    2.521    0.000    5.514    0.000 /home/andrew/p/outlines/outlines/models/transformers.py:96(convert_token_to_string)
  1759029    2.159    0.000    4.904    0.000 /nix/store/4rf5qybw37b4lh1g0xczlv14sqdbmnpm-python3-3.11.9/lib/python3.11/copy.py:128(deepcopy)
14580583/11126799    1.989    0.000    4.371    0.000 {built-in method builtins.isinstance}
2330433/2329882    1.406    0.000    3.828    0.000 /nix/store/4rf5qybw37b4lh1g0xczlv14sqdbmnpm-python3-3.11.9/lib/python3.11/functools.py:904(wrapper)
  1759029    1.323    0.000    2.993    0.000 /nix/store/m7bq08w4hkvyby4s2w04pv6jjh4jk13l-python3.11-transformers-4.41.0/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py:618(convert_tokens_to_string)
   602706    1.057    0.000    2.707    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/lark/exceptions.py:179(__init__)
 27668151    2.677    0.000    2.678    0.000 {method 'get' of 'dict' objects}
 12318040    2.217    0.000    2.217    0.000 {built-in method builtins.getattr}
  1726892    0.754    0.000    2.142    0.000 /nix/store/4rf5qybw37b4lh1g0xczlv14sqdbmnpm-python3-3.11.9/lib/python3.11/typing.py:1327(__instancecheck__)
  3518058    2.115    0.000    2.115    0.000 {method '__reduce_ex__' of 'object' objects}
  2330433    1.183    0.000    2.066    0.000 /nix/store/4rf5qybw37b4lh1g0xczlv14sqdbmnpm-python3-3.11.9/lib/python3.11/functools.py:818(dispatch)
  1154003    0.682    0.000    2.058    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/lark/lexer.py:252(new_borrow_pos)
  3518058    1.236    0.000    1.626    0.000 /nix/store/4rf5qybw37b4lh1g0xczlv14sqdbmnpm-python3-3.11.9/lib/python3.11/copyreg.py:104(__newobj__)
   602706    1.090    0.000    1.584    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/lark/exceptions.py:55(get_context)
  1759029    0.974    0.000    1.471    0.000 /home/andrew/p/outlines/outlines/fsm/parsing.py:349(__init__)
  1759029    1.451    0.000    1.451    0.000 {method 'decode' of 'tokenizers.decoders.Decoder' objects}
  1759029    1.195    0.000    1.435    0.000 /nix/store/4rf5qybw37b4lh1g0xczlv14sqdbmnpm-python3-3.11.9/lib/python3.11/copy.py:243(_keep_alive)
  1726892    1.068    0.000    1.395    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/lark/lexer.py:292(feed)
  1726892    0.946    0.000    1.388    0.000 /nix/store/4rf5qybw37b4lh1g0xczlv14sqdbmnpm-python3-3.11.9/lib/python3.11/typing.py:1602(__subclasscheck__)
  1154085    0.353    0.000    1.280    0.000 /home/andrew/p/outlines/outlines/fsm/parsing.py:867(get_contextual_lexer)
17163298/17163296    1.272    0.000    1.272    0.000 {built-in method builtins.len}
  4659212    1.141    0.000    1.141    0.000 /nix/store/mrp9s742bpjwv7lb3rv3ikv8qx72nj0d-python3.11-numba-0.59.1/lib/python3.11/site-packages/numba/core/serialize.py:30(_numba_unpickle)

TODO

@lapp0
Copy link
Collaborator Author

lapp0 commented Jul 25, 2024

Please provide comments for these issues. I will create the New Issues once they're refined / approved.

New Issues: Direct CFGGuide Fixes

We should create a milestone for CFGGuide. Here are some necessary performance and correctness improvements.

Ensure parser allows ambiguous terminals

The grammar ?start: /ab*/ /bc?/ with a generation of abbbb doesn't allow a next token of c, it requires bc.

Allow skipping rule

The grammar ?start: "a" ("b" | "c")* "d" with a generation of a doesn't allow a next token of d

Incorrectly Over-Constrained

Improve performance

  • Add benchmarks for the first 10 tokens generated, and for the last 10 of 100.
  • Improve performance generally (see profile in "Benchmarks" section of this issue)

New Issues: Other Enabled Improvements

Clean Up Dead Code

Remove

  • StopAtEosFSM
  • RegexFSM
  • Consider whether StopAtEOSGuide is useful anywhere

Ensure token decode correctness in RegexGuide as well

https://github.com/outlines-dev/outlines/pull/1067/files#r1693983229

Already Existing Issues

This PR enables the completion of the following issues issues

Allow CFG in outlines.serve

#780

We currently only allow json and regex in serve https://github.com/outlines-dev/outlines/blob/5d97ee1/outlines/serve/serve.py#L69-L70

Introduce SQL and Lark Grammars

SQL and Lark grammars and tests are already implemented in #587

Context-sensitive features such as pythons tree parser

Currently python's TreeIndenter isn't supported: #592

Fix models.llamacpp tokenizer

Currently has a different interface than all other tokenizers making CFG not work properly #936

Remove Guide.is_final_state

#885

is_final_state is ambiguous. In a separate PR we should remove is_final_state

@lapp0 lapp0 marked this pull request as ready for review July 25, 2024 21:12
@@ -614,6 +652,8 @@ def __init__(self, conf: "LexerConf", states, always_accept=()):
lexer_conf.terminals = [
terminals_by_name[n] for n in accepts if n in terminals_by_name
]
if not lexer_conf.terminals:
continue
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Bug-fix for case where no lexer_conf.terminals is empty (happens when EOS is only legal next terminal)

token_history=lexer_state.last_token and [lexer_state.last_token],
state=parser_state,
terminals_by_name=self.root_lexer.terminals,
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Fixes the following tests in test_cfg_guide.py::test_cfg_next_token

  • Multiple Valid Continuations
  • Token is Substring of Another Token
  • Recursive Patterns

@brandonwillard brandonwillard added enhancement structured generation Linked to structured generation correctness Everything related to the generation correctness grammar labels Jul 25, 2024
@brandonwillard
Copy link
Member

We need a more general and persistent warning that explains that the CFG implementation is (and has always been) experimental community-contributed code. We may even need to clarify that it does not reflect the approach described in our technical report—aside from its use of incremental/partial parsing.

@lapp0
Copy link
Collaborator Author

lapp0 commented Jul 26, 2024

Thanks, I'll update docs/reference/cfg.md with a new section, and have the warning link there.

@lapp0 lapp0 force-pushed the cfg-beta branch 4 times, most recently from 19129e5 to 23a310e Compare July 26, 2024 20:00
@lapp0 lapp0 marked this pull request as draft July 27, 2024 01:09
@lapp0 lapp0 force-pushed the cfg-beta branch 3 times, most recently from a5f529d to 823559a Compare July 27, 2024 07:36
@lapp0 lapp0 marked this pull request as ready for review July 27, 2024 07:43
@lapp0 lapp0 force-pushed the cfg-beta branch 2 times, most recently from db36739 to 5922b3b Compare July 27, 2024 16:17
@lapp0
Copy link
Collaborator Author

lapp0 commented Jul 27, 2024

I have added rejection sampling.

It checks each token for acceptance, starting with highest logprob, completing once one sample is accepted. This is effectively greedy sampling. This behavior is documented in docs/reference/cfg.md.

It is used by default in outlines.processors / outlines.generate.cfg

Benchmarks

bench_cfg_guide.CFGGuideBenchmark.time_cfg_guide_run                                                                                                                                              ============ ============
                param1                
             ------------ ------------
                 json      227±100ms  
              arithmetic   1.54±0.04s 
             ============ ============

bench_cfg_guide.CFGGuideBenchmark.time_cfg_guide_run_rejection_sampling                                                                                                                           ============ ===========
                param1               
             ------------ -----------
                 json      44.3±20ms 
              arithmetic    75.4±6ms 
             ============ ===========

Benchmarks aren't a strong indicator though, it's performance improvement is entirely dependent on the fraction of tokens which are valid under the grammars production rules at each state used in sampling.

Comment on lines +448 to +456
# normalize
if state.prev_token is None:
new_token_str = self.tokenizer.decode([token_id])[0]
else:
prev_token_str = self.tokenizer.decode([[state.prev_token]])[0]
combined_token_str = self.tokenizer.decode([[state.prev_token, token_id]])[
0
]
new_token_str = combined_token_str[len(prev_token_str) :]
Copy link
Collaborator Author

@lapp0 lapp0 Jul 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This token normalization step, which determines a tokens decoded value provided the previous token, is necessary for correctness of RegexFSM as well. Should be a separate issue.

@miftahmoha
Copy link
Contributor

miftahmoha commented Jul 27, 2024

As a I've signaled here in 28 April #788 (comment).

I've been working on a parser recently which I would say is around 95% of progress. It's built from scratch and also dependency free (excluding tests), it has its own internals when it comes to guiding a CFG. It's solely designed for that purpose.

@rlouf @brandonwillard @lapp0 Should we have a look at it when it's finished? Should we discuss it in the discord server?

The discord link in GitHub is broken BTW, it would be nice if someone could fix it.

@lapp0
Copy link
Collaborator Author

lapp0 commented Jul 28, 2024

As a I've signaled here in 28 April #788 (comment).

I've been working on a parser recently which I would say is around 95% of progress. It's built from scratch and also dependency free (excluding tests), it has its own internals when it comes to guiding a CFG. It's solely designed for that purpose.

@rlouf @brandonwillard @lapp0 Should we have a look at it when it's finished? Should we discuss it in the discord server?

The discord link in GitHub is broken BTW, it would be nice if someone could fix it.

Very interesting. Yes, we can discuss here or on Discord. Here's a temporary invite while we sort out the invite situation https://discord.gg/H7pEAMPZ

Edit: I tried all 3 discord links on GitHub, they all work for me. Which one is broken?

@miftahmoha
Copy link
Contributor

@lapp0 It seems like the problem came from my discord. I was referring to the link in the "Join us" section, it does work. Thanks.

@brandonwillard brandonwillard merged commit 72377db into dottxt-ai:main Aug 31, 2024
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
correctness Everything related to the generation correctness enhancement grammar run-benchmarks structured generation Linked to structured generation
Projects
None yet
4 participants