Slow generation, that depends on the max_tokens param when Lark grammar used #1210

plutasnyy · 2024-10-14T14:16:32Z

plutasnyy
Oct 14, 2024

Hi! I use phi-3.5 for structured generation. I set constraints on what is generated using Lark grammar. Unfortunately it gets extremely slow when I set higher max_tokens value.

Code for context:

my_grammar = """
Some subset of HTML, so allows for:
?start:  (tag1|tag2)* 
?tag1: <tag> WORD* </tag>
etc...

import common.word
"""
model_id = "microsoft/Phi-3.5-vision-instruct"

outlines_model = outlines.models.transformers_vision(
    model_id,
    model_class=AutoModelForCausalLM,
    model_kwargs={"device_map": "auto", "trust_remote_code": True, "torch_dtype": "auto"},
    device="cuda",
    processor_class=AutoProcessor,
    processor_kwargs={"trust_remote_code": True, "num_crops": 16},
)
description_generator = outlines.generate.cfg(outlines_model, my_grammar)

output = description_generator(
    f"<|image_1|>\nAnalyze image", [input_sample.pil_image], max_tokens=1024
)

some numbers:

max_tokens	generation
100	9.17 s
200	17.8 s
400	35.2 s
800	1min 10s
1600	2min 23s
2500	3m 53s

(for the context average generation time without outlines is ~30 seconds, so it is faster for shorter outputs, but of course I want to support longer outputs, so it will slow down my generation)

The output time is increasing even though for the example always around 50 tokens are generated. I guess this is because FSM has to be compiled for full 2500 steps?

Is there a way to precompute and cache FSM or just compute it in generation time for lets say +50 steps?

Answered by lapp0

Oct 14, 2024

Hi @plutasnyy

Outlines CFG is in beta, and has some performance and correctness bugs.

Terminals are all converted to FSMs before generation per https://github.com/dottxt-ai/outlines/blob/main/outlines/fsm/parsing.py#L552C9-L558

Per the profiling in the CFG beta PR ("Benchmarks" section), the slowness is due to using the partial parser to check each tokens legality.

View full answer

lapp0 · 2024-10-14T15:24:59Z

lapp0
Oct 14, 2024

Hi @plutasnyy

Outlines CFG is in beta, and has some performance and correctness bugs.

Terminals are all converted to FSMs before generation per https://github.com/dottxt-ai/outlines/blob/main/outlines/fsm/parsing.py#L552C9-L558

Per the profiling in the CFG beta PR ("Benchmarks" section), the slowness is due to using the partial parser to check each tokens legality.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow generation, that depends on the max_tokens param when Lark grammar used #1210

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Slow generation, that depends on the max_tokens param when Lark grammar used #1210

plutasnyy Oct 14, 2024

Replies: 1 comment

lapp0 Oct 14, 2024

plutasnyy
Oct 14, 2024

lapp0
Oct 14, 2024