-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Constrained decoding #288
Comments
Hi! We very much welcome you to contribute to this feature! I believe You can add this functionality by modifying the following places:
|
Curious if there's been any progress with this. I've hooked up Microsoft/Guidance and vLLM but the most powerful features aren't yet available because of missing features in vLLM. Thank you! |
Related to #535 |
Related topics:
|
I'm going to implement this. |
I'd like to help implement this aswell |
The constraint may change during the generation. For example in case of #1191 it depends on what the JSON schema allows for the next token, but that depends on where the generation is currently in the schema. We cannot use the same constraint over the whole sequence in the general case. It must also work for beam search. How can we handle that efficiently via a REST API? |
I think in case of the REST API we could allow for passing a formal description of the constraint in some generic and de facto standard format (if we can talk about it this soon) like guidance. It would allow for "compiling" the constraint inside the server and applying it to all generation of that sequence, including beam search. In case of direct vLLM calls (from Python) we could let the user to pass a callback to process the logits before the token is chosen, so the probability of any unwanted tokens can be squashed to zero. It would be efficient and allow for using any algorithm. Then we could provide adapters for the above mentioned libraries. |
Supporting the outlines library seems to be the best approach, because:
While jsonformer is limited only to JSON and guidance does not have a clear way to integrate (it has spaghetti code). |
This might be inefficient when generating structured data, for example a format like JSON, where a significant portion of the output consists of predetermined fields and symbols. Manipulating logits after a token is generated would be wasteful because we would already know what the majority of tokens are be before generation. A feature of guidance is that it avoids running generation for tokens that are already known. Given that speed and efficiency is important to vLLM, how would we go about implementing something like this when integrating outlines or another framework? |
Let's separate the two features:
Since these features are largely independent I suggest implementing them in the above order. |
Minimal prototype: #1243 |
This could be implemented by finishing LMQL integration. |
As I understand, guidance uses the I haven't tested yet but I think this is the way |
+1 to support |
How would you think about creating this? since the sampler is running only after the forward pass.. |
LM Format Enforcer is a library that achieves JSON Schema decoding and supports vLLM. (Disclosure: I am the author of the library) |
Outlines author here. The PR dottxt-ai/outlines#366 will allow easy integration into vLLM. Estimated time of completion is next week. See dottxt-ai/outlines#163 (comment) for a diagram that summarizes the new architecture. We can work together on the integration and finding the boundary that makes the most sense for both libraries. |
@rlouf did you manage to make much progress yet? |
Yes: https://outlines-dev.github.io/outlines/reference/vllm/ More is coming (soon)! |
We need a similar solution integrated into vLLM by default. I would suggest just porting over GBNF, since RegEx cannot be fully supported (also too complex) and JSON schema is too restrictive for simple use cases. |
Outlines' reference implementation of the vLLM server (https://github.com/outlines-dev/outlines/blob/main/outlines/serve/serve.py) is a copy of vLLM's https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py with a few patches and add-ons. I believe this code should rather live in vLLM instead of outlines. And there should be an analogous implementation of the OpenAI endpoint. @viktor-ferenczi, do you think this is a promising path? I'd be willing to invest time into this. |
I think it is something to be decided by the maintainers of the outlines and the vLLM projects. Currently both projects are changing rapidly and have quite a few bugs, so maybe this is something to decide later as they stabilize. I'm just a small contributor / user, not a decision maker here. |
@viktor-ferenczi, fair enough. @zhuohan123 and @rlouf, what is your assessment? |
I think it would make sense, vLLM benefits from structured generation and Outlines can re-focus on its main goals. |
Added support for guided decoding in `api_server` by integrating _outlines_ (https://github.com/outlines-dev/outlines).
It would be nice to have constrained decoding out of the box, because how it goes right now I have to fix bugs to get it working with outlines after every single vLLM update. Just to see those fixes being deleted because of yet another round of changes. |
I just read about SGLang's approach for constrained decoding. Did you consider adding that to VLLM instead of Outlines? See for example this blog article: https://lmsys.org/blog/2024-02-05-compressed-fsm/ |
SGLang's code was copied from Outlines', they just decided to import Outlines instead and implemented the change. See also this blog post that was published prior to theirs and explains the limits of a character-based approach. |
We now support full range of constrained/guided decoding as powered by Outlines, closing this as completed |
@simon-mo is there a different process to contribute documentation? Or should one just open a PR? I may have some time in three weeks… |
PR welcomed! I added sparse documentation on this https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters but more examples appreciated! |
Upstream sync 2024 06 08 (vllm-project#288) - ties to v0.4.3 of vllm-upstream SUMMARY: * Merge commits from vllm-project@f68470e to vllm-project@1197e02 * Our GCP test instances do not have `gcc` or `clang` installed. All of the triton kernels rely on the `gcc` and `clang` to generate JITs. I disabled these for now, but we need to get these installed (cc @andy-neuma). All are marked with: ```python @pytest.mark.skip("C compiler not installed in NM automation. " "This codepath follows a triton pathway, which " "JITs using clang or gcc. Since neither are installed " "in our test instances, we need to skip this for now.") ``` * Cherry-picked in the changes associated with Fp8 weight format from @mgoin Note that vllm-project@f68470e is NOT included in this merge. COMPARE vs UPSTREAM: * https://github.com/neuralmagic/nm-vllm/compare/upstream-sync-2024-06-08..vllm-project:vllm:v0.4.3 --------- Signed-off-by: kerthcet <[email protected]> Signed-off-by: Muralidhar Andoorveedu <[email protected]> Signed-off-by: pandyamarut <[email protected]> Co-authored-by: Alexander Matveev <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Wenwei Zhang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Alexey Kondratiev <[email protected]> Co-authored-by: Mor Zusman <[email protected]> Co-authored-by: Mor Zusman <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: HUANG Fei <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Kante Yin <[email protected]> Co-authored-by: sasha0552 <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: raywanb <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: Letian Li <[email protected]> Co-authored-by: Murali Andoorveedu <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Elisei Smirnov <[email protected]> Co-authored-by: Elisei Smirnov <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: leiwen83 <[email protected]> Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Eric Xihui Lin <[email protected]> Co-authored-by: beagleski <[email protected]> Co-authored-by: bapatra <[email protected]> Co-authored-by: Barun Patra <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Michał Moskal <[email protected]> Co-authored-by: Ruth Evans <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Junichi Sato <[email protected]> Co-authored-by: Marut Pandya <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: Ronen Schaffer <[email protected]> Co-authored-by: Itay Etelis <[email protected]> Co-authored-by: omkar kakarparthi <[email protected]> Co-authored-by: Alexei V. Ivanov <[email protected]> Co-authored-by: Breno Faria <[email protected]> Co-authored-by: Breno Faria <[email protected]> Co-authored-by: Hyunsung Lee <[email protected]> Co-authored-by: Chansung Park <[email protected]> Co-authored-by: SnowDist <[email protected]> Co-authored-by: functionxu123 <[email protected]> Co-authored-by: xuhao <[email protected]>
Update documentation on support of fp8
For getting structured outputs from custom-finetuned LLMs, extensive use of constrained decoding is standard.
Is there a plan to add support for DisjunctiveConstraint (and others) to vLLM in the near future?
How would one go about implementing this in vLLM (if I were to add a PR)?
The text was updated successfully, but these errors were encountered: