-
Hi! I use phi-3.5 for structured generation. I set constraints on what is generated using Lark grammar. Unfortunately it gets extremely slow when I set higher max_tokens value. Code for context: my_grammar = """
Some subset of HTML, so allows for:
?start: (tag1|tag2)*
?tag1: <tag> WORD* </tag>
etc...
import common.word
"""
model_id = "microsoft/Phi-3.5-vision-instruct"
outlines_model = outlines.models.transformers_vision(
model_id,
model_class=AutoModelForCausalLM,
model_kwargs={"device_map": "auto", "trust_remote_code": True, "torch_dtype": "auto"},
device="cuda",
processor_class=AutoProcessor,
processor_kwargs={"trust_remote_code": True, "num_crops": 16},
)
description_generator = outlines.generate.cfg(outlines_model, my_grammar)
output = description_generator(
f"<|image_1|>\nAnalyze image", [input_sample.pil_image], max_tokens=1024
) some numbers:
(for the context average generation time without outlines is ~30 seconds, so it is faster for shorter outputs, but of course I want to support longer outputs, so it will slow down my generation) The output time is increasing even though for the example always around 50 tokens are generated. I guess this is because FSM has to be compiled for full 2500 steps? Is there a way to precompute and cache FSM or just compute it in generation time for lets say +50 steps? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi @plutasnyy Outlines CFG is in beta, and has some performance and correctness bugs. Terminals are all converted to FSMs before generation per https://github.com/dottxt-ai/outlines/blob/main/outlines/fsm/parsing.py#L552C9-L558 Per the profiling in the CFG beta PR ("Benchmarks" section), the slowness is due to using the partial parser to check each tokens legality. |
Beta Was this translation helpful? Give feedback.
Hi @plutasnyy
Outlines CFG is in beta, and has some performance and correctness bugs.
Terminals are all converted to FSMs before generation per https://github.com/dottxt-ai/outlines/blob/main/outlines/fsm/parsing.py#L552C9-L558
Per the profiling in the CFG beta PR ("Benchmarks" section), the slowness is due to using the partial parser to check each tokens legality.