Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Server] Support openai prefix cache #2515

Closed
wants to merge 6 commits into from

Conversation

esmeetu
Copy link
Collaborator

@esmeetu esmeetu commented Jan 20, 2024

Adding support prefix_pos and prefix_stop parameter for api server.

  1. prefix_pos is the position of prefix string length - 1
  2. prefix_stop is the prefix stop string of prompt.
    If we have a prompt
    Below is an instruction that describes a task. Write a response that appropriately completes the request.\nHello.
    We can use prefix caching feature for the prefix string
    Below is an instruction that describes a task. Write a response that appropriately completes the request..
    So we can add prefix_stop with something special like <|prefix|>, and the real prompt should be
    Below is an instruction that describes a task. Write a response that appropriately completes the request.\n<|prefix|>Hello.

There is a example about how to use:
Bootstrap the openai server. Then request the server with below chat completion json:

{
    "model": "your-model-name",
    "messages":  {
          "role": "user",
          "content": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n<|prefix|>Hello"
    },
    "prefix_stop": "<|prefix|>"
}

Furthermore, if some model's system prompt already contains the special str, like <|endoftext|> which indicating the end of the system prompt, this prefix caching feature could be smoothly used.

@Avinash-Raj
Copy link
Contributor

Avinash-Raj commented Jan 20, 2024

@esmeetu I also created a PR (https://github.com/vllm-project/vllm/pull/2516/files
) which auto detects the prefix_pos if in-case the passed prompt is of type str.

@esmeetu
Copy link
Collaborator Author

esmeetu commented Jan 20, 2024

@Avinash-Raj I see, but there seems few overlap changes between us. I also consider whether adding auto computing prefix_pos by something else but it seems adding a few complexity to the server, where we should split prompt into input and prefix two parts. So i just simply add this param to support it. I need more ideas about prefix caching use case. @simon-mo Any ideas?

@esmeetu
Copy link
Collaborator Author

esmeetu commented Jan 21, 2024

Hi, @Avinash-Raj. I introduced a new param prefix_stop, I think it should be better than just putting all prefix str into to request. What do you think?

@FlorianJoncour
Copy link
Contributor

I think it's a good point, but I am still mixed about adding all these non-standard options.
Shouldn't all that be placed in an object like extensions or maybe vllm ?

@esmeetu
Copy link
Collaborator Author

esmeetu commented Jan 22, 2024

I think it's a good point, but I am still mixed about adding all these non-standard options.

Shouldn't all that be placed in an object like extensions or maybe vllm ?

Yeah, what you said is a better design. But maybe not related with this feature. Could you create a PR for your refactor?

@FlorianJoncour
Copy link
Contributor

Yes, probably after the merge of #2488

@Avinash-Raj
Copy link
Contributor

@esmeetu do you encounter an assertion error when using the prefix caching feature?

File "/python3.11/site-packages/vllm/worker/model_runner.py", line 783, in _pad_to_max
    assert len(x) <= max_len
           ^^^^^^^^^^^^^^^^^
AssertionError

@esmeetu
Copy link
Collaborator Author

esmeetu commented Jan 31, 2024

@Avinash-Raj No, I didn't test on v0.3.0, but it's ok when this PR was submitted.

@esmeetu esmeetu closed this Feb 29, 2024
@xyfZzz
Copy link

xyfZzz commented Mar 2, 2024

Why was this PR closed?

@zhuohan123
Copy link
Member

Why was this PR closed?

We implemented automatic prefix caching in #2762 and this API is no longer needed.

@esmeetu esmeetu deleted the openai-prefix-cache branch March 23, 2024 11:11
@esmeetu esmeetu mentioned this pull request Apr 11, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants