Use `token_ids` to track the FSM state for each sequence in the vLLM integration #539

lapp0 · 2024-01-15T13:27:56Z

Fixes #524

TODO:

Fixes beam search (and other parallel sequence generation) in vLLM
removes incompatibility with vLLM by removing seq_id argument
Awaiting Fix incompleteness of regex and cfg guided generation #544
integrate @viktor-ferenczi's changeset to enable tensor parallel w/ vLLM after merge of the above awaiting PR
convert integration test to unit tests

Separate PR:

~~remove _patched_apply_logits_processors entirely, it doesn't do anything anymore.~~
- awaiting relax pydantic version to '>=1.9.0,<3' vllm-project/vllm#2468, will leave for later

Related discussions:

rlouf · 2024-01-16T19:40:17Z

The title contains "draft" but this is not a draft PR. Does that mean it is ready for review?

viktor-ferenczi · 2024-01-17T17:28:47Z

I cannot fork your fork, therefor pasting the working fix to RegexLogitsProcessor here. It fixes the FSMState cache logic and tested to work for JSON schema and regex.

class RegexLogitsProcessor:
    def __init__(self, regex_string, llm):
        """Compile the FSM that drives the regex-guided generation.

        Parameters
        ----------
        regex_string
            A string that represents a regular expression
        llm
            An instance of `vllm.LLM`

        """
        tokenizer = self.adapt_tokenizer(llm.tokenizer)

        fsm = RegexFSM(regex_string, tokenizer)
        self.fsm = fsm
        self.fsm_state_cache: Dict[int, FSMState] = {}

    def __call__(self, input_ids: List[int], scores: torch.Tensor) -> torch.Tensor:
        """Use the FSM to bias the logits before sampling the next token."""
        state = self.get_fsm_state(tuple(input_ids))
        allowed_tokens = self.fsm.allowed_token_ids(state)

        mask = torch.full((scores.shape[-1],), -math.inf, device=scores.device)
        mask[allowed_tokens] = 0
        biased_scores = scores + mask

        return biased_scores

    def get_fsm_state(self, input_ids: Tuple[int]) -> FSMState:
        if not input_ids:
            return FSMState(0)

        state_key = hash(input_ids)
        state = self.fsm_state_cache.get(state_key)
        if state is not None:
            return state

        prev_input_ids = input_ids[:-1]
        prev_state = self.get_fsm_state(prev_input_ids)

        last_token = input_ids[-1]
        state = self.fsm.next_state(prev_state, last_token)

        self.fsm_state_cache[state_key] = state
        return state

    def adapt_tokenizer(self, tokenizer):
        """Adapt vLLM's tokenizer to use to compile the FSM.

        The API of Outlines tokenizers is slightly different to that of
        `transformers`. In addition we need to handle the missing spaces to
        Llama's tokenizer to be able to compile FSMs for this model.

        """
        tokenizer.vocabulary = tokenizer.get_vocab()
        tokenizer.special_tokens = set(tokenizer.all_special_tokens)

        def convert_token_to_string(token: str) -> str:
            from transformers.file_utils import SPIECE_UNDERLINE

            string = tokenizer.convert_tokens_to_string([token])

            # A hack to handle missing spaces to HF's Llama tokenizers
            if token.startswith(SPIECE_UNDERLINE) or token == "<0x20>":
                return " " + string

            return string

        tokenizer.convert_token_to_string = convert_token_to_string

        return tokenizer

viktor-ferenczi · 2024-01-17T17:35:05Z

I keep the working code in my dev branch here, so you can cherry-pick the fix from there as well: https://github.com/viktor-ferenczi/outlines/tree/dev

viktor-ferenczi · 2024-01-18T08:58:40Z

@lapp0 Could you please consider the above fix? It works for me. I don't want to hijack your PR, but we need this out of draft and get reviewed. Thanks!

lapp0 · 2024-01-18T11:35:22Z

@lapp0 Could you please consider the above fix? It works for me. I don't want to hijack your PR, but we need this out of draft and get reviewed. Thanks!

Today I will test, and given success incorporate your changes.

viktor-ferenczi · 2024-01-18T12:42:54Z

We also need to get some upper limit on the number of recursions here:

prev_state = self.get_fsm_state(prev_input_ids)

The default recursion limit on Python 3.10 is 1000 and I don't see it increased in vLLM.

If the maximum number of possible recursions in the above code is in the hundreds, then it may be better to rewrite the code to avoid using recursions. It is easy to do so by using a loop instead, just ends up a bit less readable.

I've been running this with huge regex constraints and haven't see any problems yet. But I haven't used a JSON schema constraint with it yet aside of a single test case covering this constraint mode in my project.

lapp0 · 2024-01-18T13:20:17Z

I cannot remove _patched_apply_logits_processor for now. CI will fail until vllm-project/vllm#2468 is merged. Will leave that to a followup PR.

lapp0 · 2024-01-18T13:28:03Z

We also need to get some upper limit on the number of recursions here:
prev_state = self.get_fsm_state(prev_input_ids)
The default recursion limit on Python 3.10 is 1000 and I don't see it increased in vLLM.

If the maximum number of possible recursions in the above code is in the hundreds, then it may be better to rewrite the code to avoid using recursions. It is easy to do so by using a loop instead, just ends up a bit less readable.

I've been running this with huge regex constraints and haven't see any problems yet. But I haven't used a JSON schema constraint with it yet aside of a single test case covering this constraint mode in my project.

Previous implementation assumed the logits processor always saw prev_input_ids.

I think we should KeyError if prev_input_ids isn't cached. We should assume we have seen the predecessor, because otherwise we are re-parsing the entire generation for each token generated.

tests/generate/test_vllm_regex_logits_processor.py

viktor-ferenczi · 2024-01-18T16:01:06Z

Previous implementation assumed the logits processor always saw prev_input_ids.

That's exactly why it crashed when I used your code here. It does not seem to be the case all the time for some reason, but I don't know why.

I think we should KeyError if prev_input_ids isn't cached. We should assume we have seen the predecessor, because otherwise we are re-parsing the entire generation for each token generated.

I will test your latest version here and see whether it works. I guess it will KeyError, but we'll see...

lapp0 · 2024-01-18T16:09:28Z

Previous implementation assumed the logits processor always saw prev_input_ids.

That's exactly why it crashed when I used your code here. It does not seem to be the case all the time for some reason, but I don't know why.

I think we should KeyError if prev_input_ids isn't cached. We should assume we have seen the predecessor, because otherwise we are re-parsing the entire generation for each token generated.

I will test your latest version here and see whether it works. I guess it will KeyError, but we'll see...

If it fails, could you give an example of an api call which results in failure so I can debug?

viktor-ferenczi · 2024-01-18T17:32:52Z

Yeah, exactly. Got a KeyError as expected:

...
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 172, in _apply_logits_processors
    logits_row = logits_processor(token_ids, logits_row)
  File "/home/viktor/dep/outlines-contrib/outlines/serve/vllm.py", line 61, in __call__
    state = self.get_fsm_state(input_ids)
  File "/home/viktor/dep/outlines-contrib/outlines/serve/vllm.py", line 78, in get_fsm_state
    prev_state = self.fsm_state_cache[prev_state_key]
KeyError: 5740354900026072187
...

This is exactly why I added that recursion to fill up any previous tokens missing in the cache.

Request was:

{
  "prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list, starting with 2.\n\n### Response:\n",
  "n": 1,
  "best_of": 1,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "repetition_penalty": 1.0,
  "temperature": 0.0,
  "top_p": 1.0,
  "top_k": -1,
  "min_p": 0.0,
  "use_beam_search": false,
  "length_penalty": 1.0,
  "early_stopping": false,
  "stop": [],
  "stop_token_ids": [],
  "include_stop_str_in_output": false,
  "ignore_eos": false,
  "max_tokens": 50,
  "logprobs": null,
  "prompt_logprobs": null,
  "skip_special_tokens": true,
  "spaces_between_special_tokens": true,
  "regex": "\\d+(\\s*,\\s*\\d+)*\\s*"
}

Model: TheBloke/deepseek-coder-33B-instruct-AWQ

GPUs: 2x4090 (2x24GB)

vLLM command:

python -O -u -m outlines.serve.serve \
  --model=TheBloke/deepseek-coder-33B-instruct-AWQ \
  --quantization=awq \
  --dtype=float16 \
  --host=0.0.0.0 \
  --port=8000 \
  --max-model-len=16384 \
  --max-num-seqs=16 \
  --tensor-parallel-size=2 \
  --swap-space=8 \
  --gpu-memory-utilization=0.95 \
  --enforce-eager \
  --disable-log-requests

lapp0 · 2024-01-18T18:04:08Z

@viktor-ferenczi thank you, will investigate.

viktor-ferenczi · 2024-01-18T18:55:08Z

@lapp0 Caught the KeyError and printed the relevant variable values:

(RayWorkerVllm pid=205777) state_key = 4477961998282403984
(RayWorkerVllm pid=205777) input_ids = [17]
(RayWorkerVllm pid=205777) self.fsm_state_cache = {}

So it fails on the very first token, when there has been no previous tokens.

The KeyError is raised at this code line:

prev_state = self.fsm_state_cache[prev_state_key]

The prev_state_key here is hash( () ), e.g. the hash of an empty tuple, which is constant: 5740354900026072187

Certainly this item is not present in in self.fsm_state_cache, because the cache is empty.

It means that the body of this if condition is never executed:

        if not input_ids:
            self.fsm_state_cache[state_key] = FSMState(0)

I think my fix was correct, unless it can be explained why the above happens.

lapp0 · 2024-01-18T19:59:51Z

@lapp0 Caught the KeyError and printed the relevant variable values:
(RayWorkerVllm pid=205777) state_key = 4477961998282403984
(RayWorkerVllm pid=205777) input_ids = [17]
(RayWorkerVllm pid=205777) self.fsm_state_cache = {}
So it fails on the very first token, when there has been no previous tokens.

The KeyError is raised at this code line:
prev_state = self.fsm_state_cache[prev_state_key]
The prev_state_key here is hash( () ), e.g. the hash of an empty tuple, which is constant: 5740354900026072187

Certainly this item is not present in in self.fsm_state_cache, because the cache is empty.

It means that the body of this if condition is never executed:
        if not input_ids:
            self.fsm_state_cache[state_key] = FSMState(0)
I think my fix was correct, unless it can be explained why the above happens.

Makes sense, the only variation from the previous implementation was that a defaultdict is no longer used. Your recursive solution alleviated this, but my removal of the recursion resulted in the error.

For the mentioned reason I still don't think we should recurse.

Before I push a new test case and specific handling for empty previous token IDs, could you confirm that you never received the log line

 input_ids = []

on a separate ray worker?

I want to eliminate the possibility that ray workers duplicated the logits processors resulting in two separate state caches. I'm not even sure how you have a vLLM world size of 2, as I ran into issues with this until I made the logits processor a separate ray.actor.

viktor-ferenczi · 2024-01-18T21:06:27Z

Fortunately I kept all the logs from my tests today and can confirm that they don't contain any lines with input_ids = [] in it.

I had crashes before with defaultdict as well, just different ones. Anyway, I don't like the recursion either, so any solution which works without that would be perfect.

What I see is that for some reason the cache is not initialized here with the empty token_ids list case, but it did happen in your tests. Why is that difference?

Are we still sticking to vLLM 0.2.6?

I'm using that version, because it was mentioned on the doc page that outlines.serve.serve requires that one.

rlouf · 2024-01-18T22:23:09Z

You should be able to use vLLM 0.2.7, I opened a PR to change the doc #547

lapp0 · 2024-01-19T12:13:32Z

@viktor-ferenczi I wasn't able to reproduce for --tensor-parallel-size=1, I got reasonable results for your prime list query.

For --tensor-parallel-size=2 I got the KeyError you described.

Some observations:

input_ids = [] was processed
The same ray worker was used for processing the 0th and 1st (KeyError) sequence which is confusing since I expected cache failures to result from different ray workers being used.

But it seems this is an issue relating to tensor parallel. Using the recursive solution appears to guarantee that the entire sequence is reprocessed for each token.

Should be resolved in #524

rlouf · 2024-02-05T10:41:13Z

tests/serve/test_vllm.py

+
+
+@pytest.mark.parametrize("forget_logits_processor", [True, False])
+def test_time_regexp(forget_logits_processor):


What is this testing exactly?

In vLLM 0.2.6 logits processors would be applied on "random" ray worker. This would result in a regression because a new logits processor had to be created to be created for each worker, causing the loss of previous state.

This test simulates that scenario by creating a new logits processor for every token processed.

However I will have to experiment to determine whether this test and the changes to make it pass are still necessary, as vLLM now applies logits processors on the same worker each time as of vLLM 0.2.7 per #539 (comment)

rlouf · 2024-02-05T10:41:40Z

tests/serve/test_vllm.py

+    assert re.fullmatch(pattern, llm.tokenizer.decode(token_ids)) is not None
+
+
+def test_time_regexp_multiple_samples():


What is this testing?

I observed a lack of stability in sequence order when using beam search with Outlines. This resulted in a new token for one sequence being applied to a different sequence.

This test reproduces that behavior. It fails on main and passes with these changes.

I will leave an explanatory doc string.

rlouf · 2024-02-05T10:42:20Z

pyproject.toml

@@ -60,7 +60,7 @@ test = [
    "huggingface_hub"
 ]
 serve = [


Does it work with vllm==0.3.0?

It reproduces the case where state 5 is missing from the generated `fsm.states_to_token_maps`.

This test case fails now, which is expected until the fix is applied.

…e cache key

This reverts commit f6e6743.

This reverts commit db6ef24.

lapp0 · 2024-02-05T19:42:20Z

It appears that with vllm==0.2.7 gathering tensors to a single worker fixed the tensor parallel issue entirely AND it fixed the unstable sequence ordering issue which was breaking beam search / parallel generation.

@viktor-ferenczi I'm going to close this PR. Thanks so much for your help on it, but it seems vllm resolved the issue upstream. Let me know if you think there's any work that is remaining (aside from your fix in #606)

Smoke Tests

Tests run on 2x A100 SXM4 with the following installed:

vllm==0.3.0
torch==2.1.2

Smoke test summary:

Does tensor parallel already work on outlines==0.0.26?
- Yes
Does beam search already work on outlines==0.0.26?
- Yes
Does multinomial sampling already work on outlines==0.0.26?
- Yes

Smoke test tensor parallel (`outlines==0.0.26`)

Invoke server:

python3 -m outlines.serve.serve --tensor-parallel-size=2 --model="mistralai/Mistral-7B-Instruct-v0.2"

Call:

curl http://127.0.0.1:8000/generate \
    -d '{
        "prompt": "What is Pi? Give me the first 15 digits: ",
        "regex": "(-)?(0|[1-9][0-9]*)(\\.[0-9]+)?([eE][+-][0-9]+)?"
        }'

Result:

{"text":["What is Pi? Give me the first 15 digits: 3.14159265358979"]}

Smoke test beam search (`outlines==0.0.26`)

 curl http://127.0.0.1:8000/generate     -d '{
        "prompt": "Give me a sequence of EITHER letters or numbers, sequential, in order, starting with A or 1: ",
        "use_beam_search": 1,
        "n": 4,
        "temperature": 0,
        "max_tokens": 128
}'

Result:

{
  "text": [
    "Give me a sequence of EITHER letters or numbers, sequential, in order, starting with A or 1: 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S,",
    "Give me a sequence of EITHER letters or numbers, sequential, in order, starting with A or 1: 1, 2, 3, 4, 5, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y,",
    "Give me a sequence of EITHER letters or numbers, sequential, in order, starting with A or 1: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q,",
    "Give me a sequence of EITHER letters or numbers, sequential, in order, starting with A or 1: 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S\n"
  ]
}

Note: different beams transition from numbers to letters at different points, but they retain internal consistency in their ordering.

rlouf · 2024-02-05T21:31:43Z

I don't see any issue linked to this PR, but maybe it closes a few?

lapp0 · 2024-02-05T21:48:21Z

@rlouf

#579 resolves #524

@viktor-ferenczi

For `outlines/vllm` previously FSM-sequence correspondence was broken, resulting FSM state being mixed between sequences, corrupting output. To alleviate this, we have `_patched_apply_logits_processor` which passes a stable sequence ID to the logits processor. In this PR we eliminate `_patched_apply_logits_processor` and cache FSM state based on the states input IDs. Continuation of #539 but much simpler because vllm upgrade fixed a lot of the issues being addressed there. Related discussions: - #624 Fixes: - Fixes #605 - Fixes #610 Already fixed: - #524 (this one can be closed, as it's was addressed previously by upgrading vllm) @viktor-ferenczi can you please confirm whether this branch fixes either #610 or #605 # Smoke tests ### basic parallel passed <details> ``` import json import vllm from pydantic import BaseModel from typing import List import torch import pandas as pd from outlines.serve.vllm import JSONLogitsProcessor class ConceptsList(BaseModel): concepts: List[str] BASE_MODEL = "microsoft/phi-2" llm = vllm.LLM(model=BASE_MODEL, tensor_parallel_size=1, dtype=torch.float16, max_model_len=2048) logits_processor = JSONLogitsProcessor(ConceptsList, llm.llm_engine) full_prompts = [ f"Provide me a list of {i} strings with key 'concepts'" for i in range(20) ] batch_results = llm.generate( full_prompts, sampling_params=vllm.SamplingParams( max_tokens=2048, logits_processors=[logits_processor] ), ) for result in batch_results: for output in result.outputs: json.loads(output.text) ``` </details> ### never ending regex passed <details> `python3 -m outlines.serve.serve --model="microsoft/phi-2"` ``` curl http://127.0.0.1:8000/generate \ -d '{ "prompt": "Sequence of numbers and letters:", "regex": "([123]-[abc]-([def]-)?)*", "n": 7 }' {"text":["Sequence of numbers and letters:1-a-1-b-1-c-1-a-","Sequence of numbers and letters:1-a-2-b-3-c-1-a-","Sequence of numbers and letters:1-a-2-b-3-c-d-1-","Sequence of numbers and letters:2-a-1-b-2-c-1-b-","Sequence of numbers and letters:2-b-3-c-d-2-b-3-","Sequence of numbers and letters:2-a-3-b-2-b-1-c-","Sequence of numbers and letters:2-a-3-b-d-2-a-3-"]} # rules for the above to validate correct FSM-sequence correspondence: # [123] always followed by [abc], [def] only ever preceded by [abc] # 1-a-1-b-1-c-1-a- # 1-a-2-b-3-c-1-a- # 1-a-2-b-3-c-d-1- # 2-a-1-b-2-c-1-b- # 2-b-3-c-d-2-b-3- # 2-a-3-b-2-b-1-c- # 2-a-3-b-d-2-a-3- ``` </details> ### sometimes ending early regex passed <details> `python3 -m outlines.serve.serve --model="microsoft/phi-2"` ``` curl http://127.0.0.1:8000/generate \ -d '{ "prompt": "Sequence of numbers and letters:", "regex": "([123]-[abc]-([def]-)?){3}", "n": 16 }' ``` output ``` {"text":["Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:3-a-1-b-2-c-d-","Sequence of numbers and letters:2-a-1-b-3-c-d-","Sequence of numbers and letters:1-a-1-b-1-c-d-","Sequence of numbers and letters:2-a-3-b-d-1-c-e-","Sequence of numbers and letters:1-b-3-a-2-c-d-","Sequence of numbers and letters:3-a-d-1-b-e-2-c-","Sequence of numbers and letters:1-a-3-b-1-b-d-","Sequence of numbers and letters:3-a-f-2-b-d-1-c-","Sequence of numbers and letters:1-b-d-3-a-e-2-c-","Sequence of numbers and letters:3-c-1-b-d-1-a-e-","Sequence of numbers and letters:1-c-1-c-e-1-b-e-"]} ``` analysis: ``` 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 3-a-1-b-2-c-d- 2-a-1-b-3-c-d- 1-a-1-b-1-c-d- 2-a-3-b-d-1-c-e- 1-b-3-a-2-c-d- 3-a-d-1-b-e-2-c- 1-a-3-b-1-b-d- 3-a-f-2-b-d-1-c- 1-b-d-3-a-e-2-c- 3-c-1-b-d-1-a-e- 1-c-1-c-e-1-b-e- ``` Observations: - All patterns are correct - Patterns don't "borrow" FSM state from one-another, they retain their own independent state - Some patterns produced more tokens than others successfully </details> ### Viktor's regex passed <details> `python3 -m outlines.serve.serve --model="microsoft/phi-2"` ``` curl http://127.0.0.1:8000/generate \ -d '{ "prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list, starting with 2.\n\n### Response:\n", "n": 1, "best_of": 1, "presence_penalty": 0.0, "frequency_penalty": 0.0, "repetition_penalty": 1.0, "temperature": 0.0, "top_p": 1.0, "top_k": -1, "min_p": 0.0, "use_beam_search": false, "length_penalty": 1.0, "early_stopping": false, "stop": [], "stop_token_ids": [], "include_stop_str_in_output": false, "ignore_eos": false, "max_tokens": 50, "logprobs": null, "prompt_logprobs": null, "skip_special_tokens": true, "spaces_between_special_tokens": true, "regex": "\\d+(\\s*,\\s*\\d+)*\\s*" }' ``` output: ``` {"text":["You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list, starting with 2.\n\n### Response:\n2, 3, 5, 7, 11, 13, 17, 19, 23, 29\n"]} ``` </details> ### Viktors schema passed <details> `python3 -m outlines.serve.serve --model="microsoft/phi-2"` ``` curl http://127.0.0.1:8000/generate \ -d '{ "prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n", "n": 5, "best_of": 5, "presence_penalty": 0.0, "frequency_penalty": 0.0, "repetition_penalty": 1.0, "temperature": 1.0, "top_p": 1.0, "top_k": -1, "min_p": 0.0, "use_beam_search": false, "length_penalty": 1.0, "early_stopping": false, "stop": [], "stop_token_ids": [], "include_stop_str_in_output": false, "ignore_eos": false, "max_tokens": 200, "logprobs": null, "prompt_logprobs": null, "skip_special_tokens": true, "spaces_between_special_tokens": true, "schema": { "properties": { "kind": { "title": "Kind", "type": "string" }, "color": { "title": "Color", "type": "string" }, "count": { "title": "Count", "type": "integer" }, "weight": { "title": "Weight", "type": "number" }, "sweet": { "title": "Sweet", "type": "boolean" } }, "required": [ "kind", "color", "count", "weight", "sweet" ], "title": "Fruit", "type": "object" } }' ``` output: ``` {"text":["You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n\"kind\": \"Apple\",\n\"color\": \"Red\",\n\"count\": 10,\n\"weight\": 0.2,\n\"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n \"kind\": \"Apple\",\n \"color\": \"Red\",\n \"count\": 10,\n \"weight\": 0.2,\n \"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n \"kind\": \"apple\",\n \"color\": \"red\",\n \"count\": 5,\n \"weight\": 0.1,\n \"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n\"kind\": \"Apple\",\n\"color\": \"Red\",\n\"count\": 10,\n\"weight\": 0.24,\n\"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n \"kind\": \"Apple\",\n \"color\": \"red\",\n \"count\": 5,\n \"weight\": 0.3,\n \"sweet\": true\n}"]} ``` </details> --------- Co-authored-by: Andrew Lapp <[email protected]>

lapp0 mentioned this pull request Jan 15, 2024

VLLM tensor-parallel and RegexLogitsProcessor #524

Closed

mory91 mentioned this pull request Jan 16, 2024

Add CFG-guided generation to the vLLM integration #541

Closed

rlouf changed the title ~~Draft: fix beam search and multiple concurrent sequences using token_id tuple cache key~~ User token_ids to track the FSM state for each sequence in the vLLM integration Jan 16, 2024

rlouf changed the title ~~User token_ids to track the FSM state for each sequence in the vLLM integration~~ Use token_ids to track the FSM state for each sequence in the vLLM integration Jan 16, 2024

rlouf added enhancement vLLM Things involving vLLM support labels Jan 16, 2024

rlouf marked this pull request as draft January 16, 2024 19:44

viktor-ferenczi mentioned this pull request Jan 18, 2024

Way to sponsor the project #551

Closed

lapp0 marked this pull request as ready for review January 18, 2024 13:44

rlouf reviewed Jan 18, 2024

View reviewed changes

tests/generate/test_vllm_regex_logits_processor.py Outdated Show resolved Hide resolved

rlouf reviewed Jan 18, 2024

View reviewed changes

tests/generate/test_vllm_regex_logits_processor.py Outdated Show resolved Hide resolved

lapp0 mentioned this pull request Jan 19, 2024

Add Grammars vllm-project/vllm#2105

Closed

11 tasks

lapp0 marked this pull request as draft January 21, 2024 19:27

lapp0 force-pushed the fix-vllm-group-generation branch from eab287c to b19fddb Compare February 4, 2024 04:53

lapp0 marked this pull request as ready for review February 4, 2024 04:53

rlouf reviewed Feb 5, 2024

View reviewed changes

pyproject.toml Outdated

@@ -60,7 +60,7 @@ test = [

"huggingface_hub"

]

serve = [

Copy link

Member

rlouf Feb 5, 2024 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work with vllm==0.3.0?

lapp0 force-pushed the fix-vllm-group-generation branch from b19fddb to 73facde Compare February 5, 2024 16:31

viktor-ferenczi and others added 11 commits February 5, 2024 10:32

Regression test case

db6ef24

It reproduces the case where state 5 is missing from the generated `fsm.states_to_token_maps`.

Fixed test_regex to expect the final state

37d9e01

This test case fails now, which is expected until the fix is applied.

fix beam search and multiple concurrent sequences using token_id tupl…

8c02922

…e cache key

include viktor-ferenczi refactor

e1347ca

fix tests

2f63743

don't recurse, assume previous input was handled by logits processor

f3043d9

move tests

28e3fdc

fix tests

b490e8b

integrate CachedRegexFSM from @viktor-ferenczi

70bcc52

dead code

e913c6e

fix tests s.t. they mock forgetting the logits processor

f6e6743

lapp0 force-pushed the fix-vllm-group-generation branch from 73facde to f6e6743 Compare February 5, 2024 16:34

Revert "fix tests s.t. they mock forgetting the logits processor"

142eb0d

This reverts commit f6e6743.

lapp0 force-pushed the fix-vllm-group-generation branch from 11f2286 to 142eb0d Compare February 5, 2024 16:44

Andrew Lapp added 4 commits February 5, 2024 10:45

Revert "Regression test case"

751bd62

This reverts commit db6ef24.

fix bad rebase

3a86332

make adjustments for vllm 0.3.0

7790c25

fix bad rebase again

7d15257

lapp0 closed this Feb 5, 2024

lapp0 mentioned this pull request Feb 9, 2024

Keep track of state in RegexLogitsProcessor using input_ids #628

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `token_ids` to track the FSM state for each sequence in the vLLM integration #539

Use `token_ids` to track the FSM state for each sequence in the vLLM integration #539

lapp0 commented Jan 15, 2024 •

edited

Loading

rlouf commented Jan 16, 2024

viktor-ferenczi commented Jan 17, 2024

viktor-ferenczi commented Jan 17, 2024

viktor-ferenczi commented Jan 18, 2024

lapp0 commented Jan 18, 2024 •

edited

Loading

viktor-ferenczi commented Jan 18, 2024 •

edited

Loading

lapp0 commented Jan 18, 2024 •

edited

Loading

lapp0 commented Jan 18, 2024

viktor-ferenczi commented Jan 18, 2024

lapp0 commented Jan 18, 2024

viktor-ferenczi commented Jan 18, 2024 •

edited

Loading

lapp0 commented Jan 18, 2024

viktor-ferenczi commented Jan 18, 2024 •

edited

Loading

lapp0 commented Jan 18, 2024

viktor-ferenczi commented Jan 18, 2024 •

edited

Loading

rlouf commented Jan 18, 2024

lapp0 commented Jan 19, 2024 •

edited

Loading

rlouf Feb 5, 2024

lapp0 Feb 5, 2024

rlouf Feb 5, 2024

lapp0 Feb 5, 2024 •

edited

Loading

rlouf Feb 5, 2024 •

edited

Loading

lapp0 commented Feb 5, 2024 •

edited

Loading

rlouf commented Feb 5, 2024

lapp0 commented Feb 5, 2024



		@pytest.mark.parametrize("forget_logits_processor", [True, False])
		def test_time_regexp(forget_logits_processor):

		assert re.fullmatch(pattern, llm.tokenizer.decode(token_ids)) is not None


		def test_time_regexp_multiple_samples():

Use token_ids to track the FSM state for each sequence in the vLLM integration #539

Use token_ids to track the FSM state for each sequence in the vLLM integration #539

Conversation

lapp0 commented Jan 15, 2024 • edited Loading

rlouf commented Jan 16, 2024

viktor-ferenczi commented Jan 17, 2024

viktor-ferenczi commented Jan 17, 2024

viktor-ferenczi commented Jan 18, 2024

lapp0 commented Jan 18, 2024 • edited Loading

viktor-ferenczi commented Jan 18, 2024 • edited Loading

lapp0 commented Jan 18, 2024 • edited Loading

lapp0 commented Jan 18, 2024

viktor-ferenczi commented Jan 18, 2024

lapp0 commented Jan 18, 2024

viktor-ferenczi commented Jan 18, 2024 • edited Loading

lapp0 commented Jan 18, 2024

viktor-ferenczi commented Jan 18, 2024 • edited Loading

lapp0 commented Jan 18, 2024

viktor-ferenczi commented Jan 18, 2024 • edited Loading

rlouf commented Jan 18, 2024

lapp0 commented Jan 19, 2024 • edited Loading

rlouf Feb 5, 2024

Choose a reason for hiding this comment

lapp0 Feb 5, 2024

Choose a reason for hiding this comment

rlouf Feb 5, 2024

Choose a reason for hiding this comment

lapp0 Feb 5, 2024 • edited Loading

Choose a reason for hiding this comment

rlouf Feb 5, 2024 • edited Loading

Choose a reason for hiding this comment

lapp0 commented Feb 5, 2024 • edited Loading

Smoke Tests

Smoke test summary:

Smoke test tensor parallel (outlines==0.0.26)

Smoke test beam search (outlines==0.0.26)

rlouf commented Feb 5, 2024

lapp0 commented Feb 5, 2024

Use `token_ids` to track the FSM state for each sequence in the vLLM integration #539

Use `token_ids` to track the FSM state for each sequence in the vLLM integration #539

lapp0 commented Jan 15, 2024 •

edited

Loading

lapp0 commented Jan 18, 2024 •

edited

Loading

viktor-ferenczi commented Jan 18, 2024 •

edited

Loading

lapp0 commented Jan 18, 2024 •

edited

Loading

viktor-ferenczi commented Jan 18, 2024 •

edited

Loading

viktor-ferenczi commented Jan 18, 2024 •

edited

Loading

viktor-ferenczi commented Jan 18, 2024 •

edited

Loading

lapp0 commented Jan 19, 2024 •

edited

Loading

lapp0 Feb 5, 2024 •

edited

Loading

rlouf Feb 5, 2024 •

edited

Loading

lapp0 commented Feb 5, 2024 •

edited

Loading

Smoke test tensor parallel (`outlines==0.0.26`)

Smoke test beam search (`outlines==0.0.26`)