-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Server] Support openai prefix cache #2515
Conversation
@esmeetu I also created a PR (https://github.com/vllm-project/vllm/pull/2516/files |
@Avinash-Raj I see, but there seems few overlap changes between us. I also consider whether adding auto computing |
Hi, @Avinash-Raj. I introduced a new param |
I think it's a good point, but I am still mixed about adding all these non-standard options. |
Yeah, what you said is a better design. But maybe not related with this feature. Could you create a PR for your refactor? |
Yes, probably after the merge of #2488 |
@esmeetu do you encounter an assertion error when using the prefix caching feature?
|
@Avinash-Raj No, I didn't test on v0.3.0, but it's ok when this PR was submitted. |
Why was this PR closed? |
We implemented automatic prefix caching in #2762 and this API is no longer needed. |
Adding support
prefix_pos
andprefix_stop
parameter for api server.prefix_pos
is the position of prefix string length - 1prefix_stop
is the prefix stop string of prompt.If we have a prompt
Below is an instruction that describes a task. Write a response that appropriately completes the request.\nHello
.We can use prefix caching feature for the prefix string
Below is an instruction that describes a task. Write a response that appropriately completes the request.
.So we can add
prefix_stop
with something special like<|prefix|>
, and the real prompt should beBelow is an instruction that describes a task. Write a response that appropriately completes the request.\n<|prefix|>Hello
.There is a example about how to use:
Bootstrap the openai server. Then request the server with below chat completion json:
Furthermore, if some model's system prompt already contains the special str, like
<|endoftext|>
which indicating the end of the system prompt, this prefix caching feature could be smoothly used.