-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use token_ids
to track the FSM state for each sequence in the vLLM integration
#539
Conversation
The title contains "draft" but this is not a draft PR. Does that mean it is ready for review? |
token_ids
to track the FSM state for each sequence in the vLLM integration
token_ids
to track the FSM state for each sequence in the vLLM integrationtoken_ids
to track the FSM state for each sequence in the vLLM integration
I cannot fork your fork, therefor pasting the working fix to class RegexLogitsProcessor:
def __init__(self, regex_string, llm):
"""Compile the FSM that drives the regex-guided generation.
Parameters
----------
regex_string
A string that represents a regular expression
llm
An instance of `vllm.LLM`
"""
tokenizer = self.adapt_tokenizer(llm.tokenizer)
fsm = RegexFSM(regex_string, tokenizer)
self.fsm = fsm
self.fsm_state_cache: Dict[int, FSMState] = {}
def __call__(self, input_ids: List[int], scores: torch.Tensor) -> torch.Tensor:
"""Use the FSM to bias the logits before sampling the next token."""
state = self.get_fsm_state(tuple(input_ids))
allowed_tokens = self.fsm.allowed_token_ids(state)
mask = torch.full((scores.shape[-1],), -math.inf, device=scores.device)
mask[allowed_tokens] = 0
biased_scores = scores + mask
return biased_scores
def get_fsm_state(self, input_ids: Tuple[int]) -> FSMState:
if not input_ids:
return FSMState(0)
state_key = hash(input_ids)
state = self.fsm_state_cache.get(state_key)
if state is not None:
return state
prev_input_ids = input_ids[:-1]
prev_state = self.get_fsm_state(prev_input_ids)
last_token = input_ids[-1]
state = self.fsm.next_state(prev_state, last_token)
self.fsm_state_cache[state_key] = state
return state
def adapt_tokenizer(self, tokenizer):
"""Adapt vLLM's tokenizer to use to compile the FSM.
The API of Outlines tokenizers is slightly different to that of
`transformers`. In addition we need to handle the missing spaces to
Llama's tokenizer to be able to compile FSMs for this model.
"""
tokenizer.vocabulary = tokenizer.get_vocab()
tokenizer.special_tokens = set(tokenizer.all_special_tokens)
def convert_token_to_string(token: str) -> str:
from transformers.file_utils import SPIECE_UNDERLINE
string = tokenizer.convert_tokens_to_string([token])
# A hack to handle missing spaces to HF's Llama tokenizers
if token.startswith(SPIECE_UNDERLINE) or token == "<0x20>":
return " " + string
return string
tokenizer.convert_token_to_string = convert_token_to_string
return tokenizer |
I keep the working code in my |
@lapp0 Could you please consider the above fix? It works for me. I don't want to hijack your PR, but we need this out of draft and get reviewed. Thanks! |
Today I will test, and given success incorporate your changes. |
We also need to get some upper limit on the number of recursions here: prev_state = self.get_fsm_state(prev_input_ids) The default recursion limit on Python 3.10 is 1000 and I don't see it increased in vLLM. If the maximum number of possible recursions in the above code is in the hundreds, then it may be better to rewrite the code to avoid using recursions. It is easy to do so by using a loop instead, just ends up a bit less readable. I've been running this with huge regex constraints and haven't see any problems yet. But I haven't used a JSON schema constraint with it yet aside of a single test case covering this constraint mode in my project. |
I cannot remove |
Previous implementation assumed the logits processor always saw I think we should |
That's exactly why it crashed when I used your code here. It does not seem to be the case all the time for some reason, but I don't know why.
I will test your latest version here and see whether it works. I guess it will |
If it fails, could you give an example of an api call which results in failure so I can debug? |
Yeah, exactly. Got a ...
File "/home/viktor/env/outlines/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 172, in _apply_logits_processors
logits_row = logits_processor(token_ids, logits_row)
File "/home/viktor/dep/outlines-contrib/outlines/serve/vllm.py", line 61, in __call__
state = self.get_fsm_state(input_ids)
File "/home/viktor/dep/outlines-contrib/outlines/serve/vllm.py", line 78, in get_fsm_state
prev_state = self.fsm_state_cache[prev_state_key]
KeyError: 5740354900026072187
... This is exactly why I added that recursion to fill up any previous tokens missing in the cache. Request was: {
"prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list, starting with 2.\n\n### Response:\n",
"n": 1,
"best_of": 1,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"repetition_penalty": 1.0,
"temperature": 0.0,
"top_p": 1.0,
"top_k": -1,
"min_p": 0.0,
"use_beam_search": false,
"length_penalty": 1.0,
"early_stopping": false,
"stop": [],
"stop_token_ids": [],
"include_stop_str_in_output": false,
"ignore_eos": false,
"max_tokens": 50,
"logprobs": null,
"prompt_logprobs": null,
"skip_special_tokens": true,
"spaces_between_special_tokens": true,
"regex": "\\d+(\\s*,\\s*\\d+)*\\s*"
} Model: GPUs: 2x4090 (2x24GB) vLLM command: python -O -u -m outlines.serve.serve \
--model=TheBloke/deepseek-coder-33B-instruct-AWQ \
--quantization=awq \
--dtype=float16 \
--host=0.0.0.0 \
--port=8000 \
--max-model-len=16384 \
--max-num-seqs=16 \
--tensor-parallel-size=2 \
--swap-space=8 \
--gpu-memory-utilization=0.95 \
--enforce-eager \
--disable-log-requests |
@viktor-ferenczi thank you, will investigate. |
@lapp0 Caught the
So it fails on the very first token, when there has been no previous tokens. The prev_state = self.fsm_state_cache[prev_state_key] The Certainly this item is not present in in It means that the body of this if condition is never executed: if not input_ids:
self.fsm_state_cache[state_key] = FSMState(0) I think my fix was correct, unless it can be explained why the above happens. |
Makes sense, the only variation from the previous implementation was that a defaultdict is no longer used. Your recursive solution alleviated this, but my removal of the recursion resulted in the error. For the mentioned reason I still don't think we should recurse. Before I push a new test case and specific handling for empty previous token IDs, could you confirm that you never received the log line
on a separate ray worker? I want to eliminate the possibility that ray workers duplicated the logits processors resulting in two separate state caches. I'm not even sure how you have a vLLM world size of 2, as I ran into issues with this until I made the logits processor a separate |
Fortunately I kept all the logs from my tests today and can confirm that they don't contain any lines with I had crashes before with What I see is that for some reason the cache is not initialized here with the empty Are we still sticking to vLLM I'm using that version, because it was mentioned on the doc page that |
You should be able to use vLLM |
@viktor-ferenczi I wasn't able to reproduce for For Some observations:
But it seems this is an issue relating to tensor parallel. Using the recursive solution appears to guarantee that the entire sequence is reprocessed for each token. Should be resolved in #524 |
eab287c
to
b19fddb
Compare
tests/serve/test_vllm.py
Outdated
|
||
|
||
@pytest.mark.parametrize("forget_logits_processor", [True, False]) | ||
def test_time_regexp(forget_logits_processor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this testing exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In vLLM 0.2.6 logits processors would be applied on "random" ray worker. This would result in a regression because a new logits processor had to be created to be created for each worker, causing the loss of previous state.
This test simulates that scenario by creating a new logits processor for every token processed.
However I will have to experiment to determine whether this test and the changes to make it pass are still necessary, as vLLM now applies logits processors on the same worker each time as of vLLM 0.2.7 per #539 (comment)
assert re.fullmatch(pattern, llm.tokenizer.decode(token_ids)) is not None | ||
|
||
|
||
def test_time_regexp_multiple_samples(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I observed a lack of stability in sequence order when using beam search with Outlines. This resulted in a new token for one sequence being applied to a different sequence.
This test reproduces that behavior. It fails on main
and passes with these changes.
I will leave an explanatory doc string.
pyproject.toml
Outdated
@@ -60,7 +60,7 @@ test = [ | |||
"huggingface_hub" | |||
] | |||
serve = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work with vllm==0.3.0
?
b19fddb
to
73facde
Compare
It reproduces the case where state 5 is missing from the generated `fsm.states_to_token_maps`.
This test case fails now, which is expected until the fix is applied.
73facde
to
f6e6743
Compare
This reverts commit f6e6743.
11f2286
to
142eb0d
Compare
It appears that with @viktor-ferenczi I'm going to close this PR. Thanks so much for your help on it, but it seems vllm resolved the issue upstream. Let me know if you think there's any work that is remaining (aside from your fix in #606) Smoke TestsTests run on 2x A100 SXM4 with the following installed:
Smoke test summary:
Smoke test tensor parallel (
|
I don't see any issue linked to this PR, but maybe it closes a few? |
For `outlines/vllm` previously FSM-sequence correspondence was broken, resulting FSM state being mixed between sequences, corrupting output. To alleviate this, we have `_patched_apply_logits_processor` which passes a stable sequence ID to the logits processor. In this PR we eliminate `_patched_apply_logits_processor` and cache FSM state based on the states input IDs. Continuation of #539 but much simpler because vllm upgrade fixed a lot of the issues being addressed there. Related discussions: - #624 Fixes: - Fixes #605 - Fixes #610 Already fixed: - #524 (this one can be closed, as it's was addressed previously by upgrading vllm) @viktor-ferenczi can you please confirm whether this branch fixes either #610 or #605 # Smoke tests ### basic parallel passed <details> ``` import json import vllm from pydantic import BaseModel from typing import List import torch import pandas as pd from outlines.serve.vllm import JSONLogitsProcessor class ConceptsList(BaseModel): concepts: List[str] BASE_MODEL = "microsoft/phi-2" llm = vllm.LLM(model=BASE_MODEL, tensor_parallel_size=1, dtype=torch.float16, max_model_len=2048) logits_processor = JSONLogitsProcessor(ConceptsList, llm.llm_engine) full_prompts = [ f"Provide me a list of {i} strings with key 'concepts'" for i in range(20) ] batch_results = llm.generate( full_prompts, sampling_params=vllm.SamplingParams( max_tokens=2048, logits_processors=[logits_processor] ), ) for result in batch_results: for output in result.outputs: json.loads(output.text) ``` </details> ### never ending regex passed <details> `python3 -m outlines.serve.serve --model="microsoft/phi-2"` ``` curl http://127.0.0.1:8000/generate \ -d '{ "prompt": "Sequence of numbers and letters:", "regex": "([123]-[abc]-([def]-)?)*", "n": 7 }' {"text":["Sequence of numbers and letters:1-a-1-b-1-c-1-a-","Sequence of numbers and letters:1-a-2-b-3-c-1-a-","Sequence of numbers and letters:1-a-2-b-3-c-d-1-","Sequence of numbers and letters:2-a-1-b-2-c-1-b-","Sequence of numbers and letters:2-b-3-c-d-2-b-3-","Sequence of numbers and letters:2-a-3-b-2-b-1-c-","Sequence of numbers and letters:2-a-3-b-d-2-a-3-"]} # rules for the above to validate correct FSM-sequence correspondence: # [123] always followed by [abc], [def] only ever preceded by [abc] # 1-a-1-b-1-c-1-a- # 1-a-2-b-3-c-1-a- # 1-a-2-b-3-c-d-1- # 2-a-1-b-2-c-1-b- # 2-b-3-c-d-2-b-3- # 2-a-3-b-2-b-1-c- # 2-a-3-b-d-2-a-3- ``` </details> ### sometimes ending early regex passed <details> `python3 -m outlines.serve.serve --model="microsoft/phi-2"` ``` curl http://127.0.0.1:8000/generate \ -d '{ "prompt": "Sequence of numbers and letters:", "regex": "([123]-[abc]-([def]-)?){3}", "n": 16 }' ``` output ``` {"text":["Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:3-a-1-b-2-c-d-","Sequence of numbers and letters:2-a-1-b-3-c-d-","Sequence of numbers and letters:1-a-1-b-1-c-d-","Sequence of numbers and letters:2-a-3-b-d-1-c-e-","Sequence of numbers and letters:1-b-3-a-2-c-d-","Sequence of numbers and letters:3-a-d-1-b-e-2-c-","Sequence of numbers and letters:1-a-3-b-1-b-d-","Sequence of numbers and letters:3-a-f-2-b-d-1-c-","Sequence of numbers and letters:1-b-d-3-a-e-2-c-","Sequence of numbers and letters:3-c-1-b-d-1-a-e-","Sequence of numbers and letters:1-c-1-c-e-1-b-e-"]} ``` analysis: ``` 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 3-a-1-b-2-c-d- 2-a-1-b-3-c-d- 1-a-1-b-1-c-d- 2-a-3-b-d-1-c-e- 1-b-3-a-2-c-d- 3-a-d-1-b-e-2-c- 1-a-3-b-1-b-d- 3-a-f-2-b-d-1-c- 1-b-d-3-a-e-2-c- 3-c-1-b-d-1-a-e- 1-c-1-c-e-1-b-e- ``` Observations: - All patterns are correct - Patterns don't "borrow" FSM state from one-another, they retain their own independent state - Some patterns produced more tokens than others successfully </details> ### Viktor's regex passed <details> `python3 -m outlines.serve.serve --model="microsoft/phi-2"` ``` curl http://127.0.0.1:8000/generate \ -d '{ "prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list, starting with 2.\n\n### Response:\n", "n": 1, "best_of": 1, "presence_penalty": 0.0, "frequency_penalty": 0.0, "repetition_penalty": 1.0, "temperature": 0.0, "top_p": 1.0, "top_k": -1, "min_p": 0.0, "use_beam_search": false, "length_penalty": 1.0, "early_stopping": false, "stop": [], "stop_token_ids": [], "include_stop_str_in_output": false, "ignore_eos": false, "max_tokens": 50, "logprobs": null, "prompt_logprobs": null, "skip_special_tokens": true, "spaces_between_special_tokens": true, "regex": "\\d+(\\s*,\\s*\\d+)*\\s*" }' ``` output: ``` {"text":["You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list, starting with 2.\n\n### Response:\n2, 3, 5, 7, 11, 13, 17, 19, 23, 29\n"]} ``` </details> ### Viktors schema passed <details> `python3 -m outlines.serve.serve --model="microsoft/phi-2"` ``` curl http://127.0.0.1:8000/generate \ -d '{ "prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n", "n": 5, "best_of": 5, "presence_penalty": 0.0, "frequency_penalty": 0.0, "repetition_penalty": 1.0, "temperature": 1.0, "top_p": 1.0, "top_k": -1, "min_p": 0.0, "use_beam_search": false, "length_penalty": 1.0, "early_stopping": false, "stop": [], "stop_token_ids": [], "include_stop_str_in_output": false, "ignore_eos": false, "max_tokens": 200, "logprobs": null, "prompt_logprobs": null, "skip_special_tokens": true, "spaces_between_special_tokens": true, "schema": { "properties": { "kind": { "title": "Kind", "type": "string" }, "color": { "title": "Color", "type": "string" }, "count": { "title": "Count", "type": "integer" }, "weight": { "title": "Weight", "type": "number" }, "sweet": { "title": "Sweet", "type": "boolean" } }, "required": [ "kind", "color", "count", "weight", "sweet" ], "title": "Fruit", "type": "object" } }' ``` output: ``` {"text":["You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n\"kind\": \"Apple\",\n\"color\": \"Red\",\n\"count\": 10,\n\"weight\": 0.2,\n\"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n \"kind\": \"Apple\",\n \"color\": \"Red\",\n \"count\": 10,\n \"weight\": 0.2,\n \"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n \"kind\": \"apple\",\n \"color\": \"red\",\n \"count\": 5,\n \"weight\": 0.1,\n \"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n\"kind\": \"Apple\",\n\"color\": \"Red\",\n\"count\": 10,\n\"weight\": 0.24,\n\"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n \"kind\": \"Apple\",\n \"color\": \"red\",\n \"count\": 5,\n \"weight\": 0.3,\n \"sweet\": true\n}"]} ``` </details> --------- Co-authored-by: Andrew Lapp <[email protected]>
Fixes #524
TODO:
seq_id
argumentSeparate PR:
remove_patched_apply_logits_processors
entirely, it doesn't do anything anymore.Related discussions: