[Server] Support openai prefix cache #2515

esmeetu · 2024-01-20T00:59:25Z

Adding support prefix_pos and prefix_stop parameter for api server.

prefix_pos is the position of prefix string length - 1
prefix_stop is the prefix stop string of prompt.
If we have a prompt
Below is an instruction that describes a task. Write a response that appropriately completes the request.\nHello.
We can use prefix caching feature for the prefix string
Below is an instruction that describes a task. Write a response that appropriately completes the request..
So we can add prefix_stop with something special like <|prefix|>, and the real prompt should be
Below is an instruction that describes a task. Write a response that appropriately completes the request.\n<|prefix|>Hello.

There is a example about how to use:
Bootstrap the openai server. Then request the server with below chat completion json:

{
    "model": "your-model-name",
    "messages":  {
          "role": "user",
          "content": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n<|prefix|>Hello"
    },
    "prefix_stop": "<|prefix|>"
}

Furthermore, if some model's system prompt already contains the special str, like <|endoftext|> which indicating the end of the system prompt, this prefix caching feature could be smoothly used.

Avinash-Raj · 2024-01-20T05:48:51Z

@esmeetu I also created a PR (https://github.com/vllm-project/vllm/pull/2516/files
) which auto detects the prefix_pos if in-case the passed prompt is of type str.

esmeetu · 2024-01-20T12:27:03Z

@Avinash-Raj I see, but there seems few overlap changes between us. I also consider whether adding auto computing prefix_pos by something else but it seems adding a few complexity to the server, where we should split prompt into input and prefix two parts. So i just simply add this param to support it. I need more ideas about prefix caching use case. @simon-mo Any ideas?

esmeetu · 2024-01-21T06:00:52Z

Hi, @Avinash-Raj. I introduced a new param prefix_stop, I think it should be better than just putting all prefix str into to request. What do you think?

FlorianJoncour · 2024-01-21T23:59:29Z

I think it's a good point, but I am still mixed about adding all these non-standard options.
Shouldn't all that be placed in an object like extensions or maybe vllm ?

esmeetu · 2024-01-22T00:22:45Z

I think it's a good point, but I am still mixed about adding all these non-standard options.

Shouldn't all that be placed in an object like extensions or maybe vllm ?

Yeah, what you said is a better design. But maybe not related with this feature. Could you create a PR for your refactor?

FlorianJoncour · 2024-01-22T00:37:02Z

Yes, probably after the merge of #2488

Avinash-Raj · 2024-01-31T12:02:49Z

@esmeetu do you encounter an assertion error when using the prefix caching feature?

File "/python3.11/site-packages/vllm/worker/model_runner.py", line 783, in _pad_to_max
    assert len(x) <= max_len
           ^^^^^^^^^^^^^^^^^
AssertionError

esmeetu · 2024-01-31T13:21:19Z

@Avinash-Raj No, I didn't test on v0.3.0, but it's ok when this PR was submitted.

xyfZzz · 2024-03-02T10:11:19Z

Why was this PR closed?

zhuohan123 · 2024-03-04T00:00:01Z

Why was this PR closed?

We implemented automatic prefix caching in #2762 and this API is no longer needed.

esmeetu added 2 commits January 20, 2024 08:53

support openai server prefix pos

815e190

format

4db6cd4

add prefix_stop str

897de6a

esmeetu added 3 commits January 28, 2024 15:19

fix

a7b90d7

Merge remote-tracking branch 'upstream/main' into openai-prefix-cache

3ff80ca

format

6f9f0d2

esmeetu closed this Feb 29, 2024

esmeetu deleted the openai-prefix-cache branch March 23, 2024 11:11

esmeetu mentioned this pull request Apr 11, 2024

[Frontend] Support Tool and RAG #3971

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Server] Support openai prefix cache #2515

[Server] Support openai prefix cache #2515

esmeetu commented Jan 20, 2024 •

edited

Loading

Avinash-Raj commented Jan 20, 2024 •

edited

Loading

esmeetu commented Jan 20, 2024

esmeetu commented Jan 21, 2024

FlorianJoncour commented Jan 21, 2024

esmeetu commented Jan 22, 2024

FlorianJoncour commented Jan 22, 2024

Avinash-Raj commented Jan 31, 2024

esmeetu commented Jan 31, 2024

xyfZzz commented Mar 2, 2024

zhuohan123 commented Mar 4, 2024

[Server] Support openai prefix cache #2515

[Server] Support openai prefix cache #2515

Conversation

esmeetu commented Jan 20, 2024 • edited Loading

Avinash-Raj commented Jan 20, 2024 • edited Loading

esmeetu commented Jan 20, 2024

esmeetu commented Jan 21, 2024

FlorianJoncour commented Jan 21, 2024

esmeetu commented Jan 22, 2024

FlorianJoncour commented Jan 22, 2024

Avinash-Raj commented Jan 31, 2024

esmeetu commented Jan 31, 2024

xyfZzz commented Mar 2, 2024

zhuohan123 commented Mar 4, 2024

esmeetu commented Jan 20, 2024 •

edited

Loading

Avinash-Raj commented Jan 20, 2024 •

edited

Loading