Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use token_ids to track the FSM state for each sequence in the vLLM integration #539

Closed
wants to merge 16 commits into from

Conversation

lapp0
Copy link
Collaborator

@lapp0 lapp0 commented Jan 15, 2024

Fixes #524

TODO:

Separate PR:

Related discussions:

@rlouf
Copy link
Member

rlouf commented Jan 16, 2024

The title contains "draft" but this is not a draft PR. Does that mean it is ready for review?

@rlouf rlouf changed the title Draft: fix beam search and multiple concurrent sequences using token_id tuple cache key User token_ids to track the FSM state for each sequence in the vLLM integration Jan 16, 2024
@rlouf rlouf changed the title User token_ids to track the FSM state for each sequence in the vLLM integration Use token_ids to track the FSM state for each sequence in the vLLM integration Jan 16, 2024
@rlouf rlouf added enhancement vLLM Things involving vLLM support labels Jan 16, 2024
@rlouf rlouf marked this pull request as draft January 16, 2024 19:44
@viktor-ferenczi
Copy link

I cannot fork your fork, therefor pasting the working fix to RegexLogitsProcessor here. It fixes the FSMState cache logic and tested to work for JSON schema and regex.

class RegexLogitsProcessor:
    def __init__(self, regex_string, llm):
        """Compile the FSM that drives the regex-guided generation.

        Parameters
        ----------
        regex_string
            A string that represents a regular expression
        llm
            An instance of `vllm.LLM`

        """
        tokenizer = self.adapt_tokenizer(llm.tokenizer)

        fsm = RegexFSM(regex_string, tokenizer)
        self.fsm = fsm
        self.fsm_state_cache: Dict[int, FSMState] = {}

    def __call__(self, input_ids: List[int], scores: torch.Tensor) -> torch.Tensor:
        """Use the FSM to bias the logits before sampling the next token."""
        state = self.get_fsm_state(tuple(input_ids))
        allowed_tokens = self.fsm.allowed_token_ids(state)

        mask = torch.full((scores.shape[-1],), -math.inf, device=scores.device)
        mask[allowed_tokens] = 0
        biased_scores = scores + mask

        return biased_scores

    def get_fsm_state(self, input_ids: Tuple[int]) -> FSMState:
        if not input_ids:
            return FSMState(0)

        state_key = hash(input_ids)
        state = self.fsm_state_cache.get(state_key)
        if state is not None:
            return state

        prev_input_ids = input_ids[:-1]
        prev_state = self.get_fsm_state(prev_input_ids)

        last_token = input_ids[-1]
        state = self.fsm.next_state(prev_state, last_token)

        self.fsm_state_cache[state_key] = state
        return state

    def adapt_tokenizer(self, tokenizer):
        """Adapt vLLM's tokenizer to use to compile the FSM.

        The API of Outlines tokenizers is slightly different to that of
        `transformers`. In addition we need to handle the missing spaces to
        Llama's tokenizer to be able to compile FSMs for this model.

        """
        tokenizer.vocabulary = tokenizer.get_vocab()
        tokenizer.special_tokens = set(tokenizer.all_special_tokens)

        def convert_token_to_string(token: str) -> str:
            from transformers.file_utils import SPIECE_UNDERLINE

            string = tokenizer.convert_tokens_to_string([token])

            # A hack to handle missing spaces to HF's Llama tokenizers
            if token.startswith(SPIECE_UNDERLINE) or token == "<0x20>":
                return " " + string

            return string

        tokenizer.convert_token_to_string = convert_token_to_string

        return tokenizer

@viktor-ferenczi
Copy link

I keep the working code in my dev branch here, so you can cherry-pick the fix from there as well: https://github.com/viktor-ferenczi/outlines/tree/dev

@viktor-ferenczi
Copy link

@lapp0 Could you please consider the above fix? It works for me. I don't want to hijack your PR, but we need this out of draft and get reviewed. Thanks!

@lapp0
Copy link
Collaborator Author

lapp0 commented Jan 18, 2024

@lapp0 Could you please consider the above fix? It works for me. I don't want to hijack your PR, but we need this out of draft and get reviewed. Thanks!

Today I will test, and given success incorporate your changes.

@viktor-ferenczi
Copy link

viktor-ferenczi commented Jan 18, 2024

We also need to get some upper limit on the number of recursions here:

prev_state = self.get_fsm_state(prev_input_ids)

The default recursion limit on Python 3.10 is 1000 and I don't see it increased in vLLM.

If the maximum number of possible recursions in the above code is in the hundreds, then it may be better to rewrite the code to avoid using recursions. It is easy to do so by using a loop instead, just ends up a bit less readable.

I've been running this with huge regex constraints and haven't see any problems yet. But I haven't used a JSON schema constraint with it yet aside of a single test case covering this constraint mode in my project.

@lapp0
Copy link
Collaborator Author

lapp0 commented Jan 18, 2024

I cannot remove _patched_apply_logits_processor for now. CI will fail until vllm-project/vllm#2468 is merged. Will leave that to a followup PR.

@lapp0
Copy link
Collaborator Author

lapp0 commented Jan 18, 2024

We also need to get some upper limit on the number of recursions here:

prev_state = self.get_fsm_state(prev_input_ids)

The default recursion limit on Python 3.10 is 1000 and I don't see it increased in vLLM.

If the maximum number of possible recursions in the above code is in the hundreds, then it may be better to rewrite the code to avoid using recursions. It is easy to do so by using a loop instead, just ends up a bit less readable.

I've been running this with huge regex constraints and haven't see any problems yet. But I haven't used a JSON schema constraint with it yet aside of a single test case covering this constraint mode in my project.

Previous implementation assumed the logits processor always saw prev_input_ids.

I think we should KeyError if prev_input_ids isn't cached. We should assume we have seen the predecessor, because otherwise we are re-parsing the entire generation for each token generated.

@lapp0 lapp0 marked this pull request as ready for review January 18, 2024 13:44
@viktor-ferenczi
Copy link

Previous implementation assumed the logits processor always saw prev_input_ids.

That's exactly why it crashed when I used your code here. It does not seem to be the case all the time for some reason, but I don't know why.

I think we should KeyError if prev_input_ids isn't cached. We should assume we have seen the predecessor, because otherwise we are re-parsing the entire generation for each token generated.

I will test your latest version here and see whether it works. I guess it will KeyError, but we'll see...

@lapp0
Copy link
Collaborator Author

lapp0 commented Jan 18, 2024

Previous implementation assumed the logits processor always saw prev_input_ids.

That's exactly why it crashed when I used your code here. It does not seem to be the case all the time for some reason, but I don't know why.

I think we should KeyError if prev_input_ids isn't cached. We should assume we have seen the predecessor, because otherwise we are re-parsing the entire generation for each token generated.

I will test your latest version here and see whether it works. I guess it will KeyError, but we'll see...

If it fails, could you give an example of an api call which results in failure so I can debug?

@viktor-ferenczi
Copy link

viktor-ferenczi commented Jan 18, 2024

Yeah, exactly. Got a KeyError as expected:

...
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 172, in _apply_logits_processors
    logits_row = logits_processor(token_ids, logits_row)
  File "/home/viktor/dep/outlines-contrib/outlines/serve/vllm.py", line 61, in __call__
    state = self.get_fsm_state(input_ids)
  File "/home/viktor/dep/outlines-contrib/outlines/serve/vllm.py", line 78, in get_fsm_state
    prev_state = self.fsm_state_cache[prev_state_key]
KeyError: 5740354900026072187
...

This is exactly why I added that recursion to fill up any previous tokens missing in the cache.

Request was:

{
  "prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list, starting with 2.\n\n### Response:\n",
  "n": 1,
  "best_of": 1,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "repetition_penalty": 1.0,
  "temperature": 0.0,
  "top_p": 1.0,
  "top_k": -1,
  "min_p": 0.0,
  "use_beam_search": false,
  "length_penalty": 1.0,
  "early_stopping": false,
  "stop": [],
  "stop_token_ids": [],
  "include_stop_str_in_output": false,
  "ignore_eos": false,
  "max_tokens": 50,
  "logprobs": null,
  "prompt_logprobs": null,
  "skip_special_tokens": true,
  "spaces_between_special_tokens": true,
  "regex": "\\d+(\\s*,\\s*\\d+)*\\s*"
}

Model: TheBloke/deepseek-coder-33B-instruct-AWQ

GPUs: 2x4090 (2x24GB)

vLLM command:

python -O -u -m outlines.serve.serve \
  --model=TheBloke/deepseek-coder-33B-instruct-AWQ \
  --quantization=awq \
  --dtype=float16 \
  --host=0.0.0.0 \
  --port=8000 \
  --max-model-len=16384 \
  --max-num-seqs=16 \
  --tensor-parallel-size=2 \
  --swap-space=8 \
  --gpu-memory-utilization=0.95 \
  --enforce-eager \
  --disable-log-requests

@lapp0
Copy link
Collaborator Author

lapp0 commented Jan 18, 2024

@viktor-ferenczi thank you, will investigate.

@viktor-ferenczi
Copy link

viktor-ferenczi commented Jan 18, 2024

@lapp0 Caught the KeyError and printed the relevant variable values:

(RayWorkerVllm pid=205777) state_key = 4477961998282403984
(RayWorkerVllm pid=205777) input_ids = [17]
(RayWorkerVllm pid=205777) self.fsm_state_cache = {}

So it fails on the very first token, when there has been no previous tokens.

The KeyError is raised at this code line:

prev_state = self.fsm_state_cache[prev_state_key]

The prev_state_key here is hash( () ), e.g. the hash of an empty tuple, which is constant: 5740354900026072187

Certainly this item is not present in in self.fsm_state_cache, because the cache is empty.

It means that the body of this if condition is never executed:

        if not input_ids:
            self.fsm_state_cache[state_key] = FSMState(0)

I think my fix was correct, unless it can be explained why the above happens.

@lapp0
Copy link
Collaborator Author

lapp0 commented Jan 18, 2024

@lapp0 Caught the KeyError and printed the relevant variable values:

(RayWorkerVllm pid=205777) state_key = 4477961998282403984
(RayWorkerVllm pid=205777) input_ids = [17]
(RayWorkerVllm pid=205777) self.fsm_state_cache = {}

So it fails on the very first token, when there has been no previous tokens.

The KeyError is raised at this code line:

prev_state = self.fsm_state_cache[prev_state_key]

The prev_state_key here is hash( () ), e.g. the hash of an empty tuple, which is constant: 5740354900026072187

Certainly this item is not present in in self.fsm_state_cache, because the cache is empty.

It means that the body of this if condition is never executed:

        if not input_ids:
            self.fsm_state_cache[state_key] = FSMState(0)

I think my fix was correct, unless it can be explained why the above happens.

Makes sense, the only variation from the previous implementation was that a defaultdict is no longer used. Your recursive solution alleviated this, but my removal of the recursion resulted in the error.

For the mentioned reason I still don't think we should recurse.

Before I push a new test case and specific handling for empty previous token IDs, could you confirm that you never received the log line

 input_ids = []

on a separate ray worker?

I want to eliminate the possibility that ray workers duplicated the logits processors resulting in two separate state caches. I'm not even sure how you have a vLLM world size of 2, as I ran into issues with this until I made the logits processor a separate ray.actor.

@viktor-ferenczi
Copy link

viktor-ferenczi commented Jan 18, 2024

Fortunately I kept all the logs from my tests today and can confirm that they don't contain any lines with input_ids = [] in it.

I had crashes before with defaultdict as well, just different ones. Anyway, I don't like the recursion either, so any solution which works without that would be perfect.

What I see is that for some reason the cache is not initialized here with the empty token_ids list case, but it did happen in your tests. Why is that difference?

Are we still sticking to vLLM 0.2.6?

I'm using that version, because it was mentioned on the doc page that outlines.serve.serve requires that one.

@rlouf
Copy link
Member

rlouf commented Jan 18, 2024

You should be able to use vLLM 0.2.7, I opened a PR to change the doc #547

@lapp0 lapp0 mentioned this pull request Jan 19, 2024
11 tasks
@lapp0
Copy link
Collaborator Author

lapp0 commented Jan 19, 2024

@viktor-ferenczi I wasn't able to reproduce for --tensor-parallel-size=1, I got reasonable results for your prime list query.

For --tensor-parallel-size=2 I got the KeyError you described.

Some observations:

  • input_ids = [] was processed
  • The same ray worker was used for processing the 0th and 1st (KeyError) sequence which is confusing since I expected cache failures to result from different ray workers being used.

But it seems this is an issue relating to tensor parallel. Using the recursive solution appears to guarantee that the entire sequence is reprocessed for each token.

Should be resolved in #524

@lapp0 lapp0 marked this pull request as draft January 21, 2024 19:27
@lapp0 lapp0 marked this pull request as ready for review February 4, 2024 04:53


@pytest.mark.parametrize("forget_logits_processor", [True, False])
def test_time_regexp(forget_logits_processor):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this testing exactly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In vLLM 0.2.6 logits processors would be applied on "random" ray worker. This would result in a regression because a new logits processor had to be created to be created for each worker, causing the loss of previous state.

This test simulates that scenario by creating a new logits processor for every token processed.

However I will have to experiment to determine whether this test and the changes to make it pass are still necessary, as vLLM now applies logits processors on the same worker each time as of vLLM 0.2.7 per #539 (comment)

assert re.fullmatch(pattern, llm.tokenizer.decode(token_ids)) is not None


def test_time_regexp_multiple_samples():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this testing?

Copy link
Collaborator Author

@lapp0 lapp0 Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed a lack of stability in sequence order when using beam search with Outlines. This resulted in a new token for one sequence being applied to a different sequence.

This test reproduces that behavior. It fails on main and passes with these changes.

I will leave an explanatory doc string.

pyproject.toml Outdated
@@ -60,7 +60,7 @@ test = [
"huggingface_hub"
]
serve = [
Copy link
Member

@rlouf rlouf Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work with vllm==0.3.0?

@lapp0
Copy link
Collaborator Author

lapp0 commented Feb 5, 2024

It appears that with vllm==0.2.7 gathering tensors to a single worker fixed the tensor parallel issue entirely AND it fixed the unstable sequence ordering issue which was breaking beam search / parallel generation.

@viktor-ferenczi I'm going to close this PR. Thanks so much for your help on it, but it seems vllm resolved the issue upstream. Let me know if you think there's any work that is remaining (aside from your fix in #606)

Smoke Tests

Tests run on 2x A100 SXM4 with the following installed:

  • vllm==0.3.0
  • torch==2.1.2

Smoke test summary:

  • Does tensor parallel already work on outlines==0.0.26?
    • Yes
  • Does beam search already work on outlines==0.0.26?
    • Yes
  • Does multinomial sampling already work on outlines==0.0.26?
    • Yes

Smoke test tensor parallel (outlines==0.0.26)

Invoke server:

python3 -m outlines.serve.serve --tensor-parallel-size=2 --model="mistralai/Mistral-7B-Instruct-v0.2"

Call:

curl http://127.0.0.1:8000/generate \
    -d '{
        "prompt": "What is Pi? Give me the first 15 digits: ",
        "regex": "(-)?(0|[1-9][0-9]*)(\\.[0-9]+)?([eE][+-][0-9]+)?"
        }'

Result:

{"text":["What is Pi? Give me the first 15 digits: 3.14159265358979"]}

Smoke test beam search (outlines==0.0.26)

 curl http://127.0.0.1:8000/generate     -d '{
        "prompt": "Give me a sequence of EITHER letters or numbers, sequential, in order, starting with A or 1: ",
        "use_beam_search": 1,
        "n": 4,
        "temperature": 0,
        "max_tokens": 128
}'

Result:

{
  "text": [
    "Give me a sequence of EITHER letters or numbers, sequential, in order, starting with A or 1: 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S,",
    "Give me a sequence of EITHER letters or numbers, sequential, in order, starting with A or 1: 1, 2, 3, 4, 5, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y,",
    "Give me a sequence of EITHER letters or numbers, sequential, in order, starting with A or 1: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q,",
    "Give me a sequence of EITHER letters or numbers, sequential, in order, starting with A or 1: 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S\n"
  ]
}

Note: different beams transition from numbers to letters at different points, but they retain internal consistency in their ordering.

@lapp0 lapp0 closed this Feb 5, 2024
@rlouf
Copy link
Member

rlouf commented Feb 5, 2024

I don't see any issue linked to this PR, but maybe it closes a few?

@lapp0
Copy link
Collaborator Author

lapp0 commented Feb 5, 2024

@rlouf

#579 resolves #524

rlouf pushed a commit that referenced this pull request Feb 14, 2024
For `outlines/vllm` previously FSM-sequence correspondence was broken,
resulting FSM state being mixed between sequences, corrupting output. To
alleviate this, we have `_patched_apply_logits_processor` which passes a
stable sequence ID to the logits processor.

In this PR we eliminate `_patched_apply_logits_processor` and cache FSM
state based on the states input IDs.

Continuation of #539 but
much simpler because vllm upgrade fixed a lot of the issues being
addressed there.

Related discussions:
- #624

Fixes:
- Fixes #605 
- Fixes #610

Already fixed:
- #524 (this one can be
closed, as it's was addressed previously by upgrading vllm)


@viktor-ferenczi can you please confirm whether this branch fixes either
#610 or
#605

# Smoke tests

### basic parallel

passed

<details>

```
import json
import vllm
from pydantic import BaseModel
from typing import List
import torch
import pandas as pd
from outlines.serve.vllm import JSONLogitsProcessor

class ConceptsList(BaseModel):
    concepts: List[str]

BASE_MODEL = "microsoft/phi-2"
llm = vllm.LLM(model=BASE_MODEL, tensor_parallel_size=1, dtype=torch.float16, max_model_len=2048)

logits_processor = JSONLogitsProcessor(ConceptsList, llm.llm_engine)

full_prompts = [
    f"Provide me a list of {i} strings with key 'concepts'"
    for i in range(20)
]

batch_results = llm.generate(
    full_prompts,
    sampling_params=vllm.SamplingParams(
        max_tokens=2048, logits_processors=[logits_processor]
    ),
)


for result in batch_results:
    for output in result.outputs:
            json.loads(output.text)
```

</details>


### never ending regex

passed

<details>

`python3 -m outlines.serve.serve --model="microsoft/phi-2"`

```
curl http://127.0.0.1:8000/generate \
    -d '{
        "prompt": "Sequence of numbers and letters:",
        "regex": "([123]-[abc]-([def]-)?)*",
        "n": 7
}'
{"text":["Sequence of numbers and letters:1-a-1-b-1-c-1-a-","Sequence of numbers and letters:1-a-2-b-3-c-1-a-","Sequence of numbers and letters:1-a-2-b-3-c-d-1-","Sequence of numbers and letters:2-a-1-b-2-c-1-b-","Sequence of numbers and letters:2-b-3-c-d-2-b-3-","Sequence of numbers and letters:2-a-3-b-2-b-1-c-","Sequence of numbers and letters:2-a-3-b-d-2-a-3-"]}


# rules for the above to validate correct FSM-sequence correspondence:
# [123] always followed by [abc], [def] only ever preceded by [abc]

# 1-a-1-b-1-c-1-a-
# 1-a-2-b-3-c-1-a-
# 1-a-2-b-3-c-d-1-
# 2-a-1-b-2-c-1-b-
# 2-b-3-c-d-2-b-3-
# 2-a-3-b-2-b-1-c-
# 2-a-3-b-d-2-a-3-
```

</details>


### sometimes ending early regex

passed

<details>

`python3 -m outlines.serve.serve --model="microsoft/phi-2"`

```
curl http://127.0.0.1:8000/generate \
    -d '{
        "prompt": "Sequence of numbers and letters:",
        "regex": "([123]-[abc]-([def]-)?){3}",
        "n": 16
}'
```

output

```
{"text":["Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:3-a-1-b-2-c-d-","Sequence of numbers and letters:2-a-1-b-3-c-d-","Sequence of numbers and letters:1-a-1-b-1-c-d-","Sequence of numbers and letters:2-a-3-b-d-1-c-e-","Sequence of numbers and letters:1-b-3-a-2-c-d-","Sequence of numbers and letters:3-a-d-1-b-e-2-c-","Sequence of numbers and letters:1-a-3-b-1-b-d-","Sequence of numbers and letters:3-a-f-2-b-d-1-c-","Sequence of numbers and letters:1-b-d-3-a-e-2-c-","Sequence of numbers and letters:3-c-1-b-d-1-a-e-","Sequence of numbers and letters:1-c-1-c-e-1-b-e-"]}
```

analysis:

```
1-a-2-b-3-c-d-
1-a-2-b-3-c-d-
1-a-2-b-3-c-d-
1-a-2-b-3-c-d-
1-a-2-b-3-c-d-
1-a-2-b-3-c-d-
3-a-1-b-2-c-d-
2-a-1-b-3-c-d-
1-a-1-b-1-c-d-
2-a-3-b-d-1-c-e-
1-b-3-a-2-c-d-
3-a-d-1-b-e-2-c-
1-a-3-b-1-b-d-
3-a-f-2-b-d-1-c-
1-b-d-3-a-e-2-c-
3-c-1-b-d-1-a-e-
1-c-1-c-e-1-b-e-
```

Observations:
- All patterns are correct
- Patterns don't "borrow" FSM state from one-another, they retain their
own independent state
- Some patterns produced more tokens than others successfully


</details>

### Viktor's regex

passed

<details>

`python3 -m outlines.serve.serve --model="microsoft/phi-2"`

```
curl http://127.0.0.1:8000/generate \
    -d '{
  "prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list, starting with 2.\n\n### Response:\n",
  "n": 1,
  "best_of": 1,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "repetition_penalty": 1.0,
  "temperature": 0.0,
  "top_p": 1.0,
  "top_k": -1,
  "min_p": 0.0,
  "use_beam_search": false,
  "length_penalty": 1.0,
  "early_stopping": false,
  "stop": [],
  "stop_token_ids": [],
  "include_stop_str_in_output": false,
  "ignore_eos": false,
  "max_tokens": 50,
  "logprobs": null,
  "prompt_logprobs": null,
  "skip_special_tokens": true,
  "spaces_between_special_tokens": true,
  "regex": "\\d+(\\s*,\\s*\\d+)*\\s*"
}'
```

output:

```
{"text":["You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list, starting with 2.\n\n### Response:\n2, 3, 5, 7, 11, 13, 17, 19, 23, 29\n"]}
```

</details>

### Viktors schema

passed

<details>

`python3 -m outlines.serve.serve --model="microsoft/phi-2"`

```
curl http://127.0.0.1:8000/generate \
    -d '{
  "prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n",
  "n": 5,
  "best_of": 5,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "repetition_penalty": 1.0,
  "temperature": 1.0,
  "top_p": 1.0,
  "top_k": -1,
  "min_p": 0.0,
  "use_beam_search": false,
  "length_penalty": 1.0,
  "early_stopping": false,
  "stop": [],
  "stop_token_ids": [],
  "include_stop_str_in_output": false,
  "ignore_eos": false,
  "max_tokens": 200,
  "logprobs": null,
  "prompt_logprobs": null,
  "skip_special_tokens": true,
  "spaces_between_special_tokens": true,
  "schema": {
    "properties": {
      "kind": {
        "title": "Kind",
        "type": "string"
      },
      "color": {
        "title": "Color",
        "type": "string"
      },
      "count": {
        "title": "Count",
        "type": "integer"
      },
      "weight": {
        "title": "Weight",
        "type": "number"
      },
      "sweet": {
        "title": "Sweet",
        "type": "boolean"
      }
    },
    "required": [
      "kind",
      "color",
      "count",
      "weight",
      "sweet"
    ],
    "title": "Fruit",
    "type": "object"
  }
}'
```

output:

```
{"text":["You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n\"kind\": \"Apple\",\n\"color\": \"Red\",\n\"count\": 10,\n\"weight\": 0.2,\n\"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n    \"kind\": \"Apple\",\n    \"color\": \"Red\",\n    \"count\": 10,\n    \"weight\": 0.2,\n    \"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n  \"kind\": \"apple\",\n  \"color\": \"red\",\n  \"count\": 5,\n  \"weight\": 0.1,\n  \"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n\"kind\": \"Apple\",\n\"color\": \"Red\",\n\"count\": 10,\n\"weight\": 0.24,\n\"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n  \"kind\": \"Apple\",\n  \"color\": \"red\",\n  \"count\": 5,\n  \"weight\": 0.3,\n  \"sweet\": true\n}"]}
```

</details>

---------

Co-authored-by: Andrew Lapp <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement vLLM Things involving vLLM support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VLLM tensor-parallel and RegexLogitsProcessor
3 participants