-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CFG-guided generation to the vLLM integration #541
Conversation
would this be compatible with distributed inference using Ray? I'm trying to run mistral 7B across 4 GPU's using vLLM's TensorParallel=4, I see the following issue: TypeError: RegexLogitsProcessor.__call__() missing 1 required positional argument: 'scores'
(RayWorkerVllm pid=24725) Could not apply nest_asyncio: Can't patch loop of type <class 'uvloop.Loop'> [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.) |
This is addressed by #539 |
This PR addresses generation of multiple sequences concurrently, which can take place on a single GPU without tensor parallelism. However, I'll get around to it since I need guided generation on multiple GPUs as well :) Tensor parallel / Ray is discussed #524 |
Thank you for opening a PR! A couple question:
|
Hi, the serve.py file I think doesn't support batching. The batching checked in example/vllm_integration.py file and it works. |
d4c4012
to
ed74659
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran this command twice and it crashed the server:
curl http://127.0.0.1:8000/generate \
-d '{
"prompt": "What is Pi? Give me the first 15 digits: ",
"grammar": "start: DECIMAL \r\nDIGIT: \"0\"..\"9\"\r\nINT: DIGIT+\r\nDECIMAL: INT \".\" INT? | \".\" INT"
}'
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
Full traceback
INFO: 127.0.0.1:41880 - "POST /generate HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 363, in run_engine_loop
has_requests_in_progress = await self.engine_step()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 342, in engine_step
request_outputs = await self.engine.step_async()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 190, in step_async
all_outputs = await self._run_workers_async(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 231, in _run_workers_async
all_outputs = await asyncio.gather(*coros)
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 189, in execute_model
output = self.model_runner.execute_model(seq_group_metadata_list,
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 461, in execute_model
output = self.model.sample(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mistral.py", line 291, in sample
next_tokens = self.sampler(self.lm_head.weight, hidden_states,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 59, in forward
logits = _apply_logits_processors(logits, sampling_metadata)
File "/root/outlines/outlines/serve/vllm.py", line 33, in _patched_apply_logits_processors
logits_row = logits_processor(seq_id, token_ids, logits_row)
File "/root/outlines/outlines/serve/vllm.py", line 93, in __call__
self.fsm_state[seq_id] = self.fsm.next_state(
File "/root/outlines/outlines/fsm/fsm.py", line 316, in next_state
self.generation += self.tokenizer.decode([token_id])[0]
TypeError: can only concatenate str (not "list") to str
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 762, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 782, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 299, in app
raise e
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 294, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/root/outlines/outlines/serve/serve.py", line 100, in generate
async for request_output in results_generator:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 449, in generate
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 443, in generate
async for request_output in stream:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 70, in __anext__
raise result
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish
raise exc
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
Environment:
python3 -c "from outlines import _version; print(_version.version)"
0.0.25.dev14+ge16d986.d20240125
python3 -c "import sys; print('Python', sys.version)"
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
pip3 freeze
follows
Thanks for pointing out the error. The reason behind that error was that the tokenizer changes the first time you make the request and the second the changes added on to the already changed tokenizer and this make it incompatible with the outlines model behavior and API. I added a flag on to the vllm decoder to check whether the vllm decoder is adapted for outlines model or not and if that's the case, do not continue the adaption of the tokenizer. This change might not seem a good choice but I think in this way we can keep the adaption code in the vllm.py file and don't leak it to the serve file. |
Thanks for the explanation. Could you write a test reproducing this problem? The test should be guarded by |
done |
Will review some time today or tomorrow. Preliminary - it seems you designed your test case in a way which doesn't actually require GPU, so you can remove the |
I agree it should be executed in all cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran a variety of smoke tests, they were successful!
One problem I saw is that the model always ends on .
, but I don't see anything in your change-set that would cause this. Seems like a separate issue. Are you able to reproduce?
Otherwise, the change-set looks good to me!
I cannot run the mistral on my machine and with other models I see the json ends without dot (.) sometimes. But a problem that I see with the grammar that you mentioned in your comment in the linked issue is that the connection is not closed after the json is finished, that is I can see the result and json in correct form but the curl connection is not closed. I don't think this problem is related to the changes either. |
If other models don't end before the dot, it must be the models fault that it ends when there's a period and not Outlines, nor Outlines integration with vLLM. |
504c015
to
9bad41b
Compare
Hi,
I don't think anything is left.
…On Fri, Feb 9, 2024 at 8:45 AM Andrew Lapp ***@***.***> wrote:
Hi @mory91 <https://github.com/mory91>, thanks for implementing this!
Could you let me know what work remains and if there's any way I could help?
—
Reply to this email directly, view it on GitHub
<#541 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADI3JQKHLC5S45OUR2W6WB3YSY777AVCNFSM6AAAAABB32MZ4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWGE2TQMBUGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
||
|
||
@pytest.mark.parametrize("logit_processor, fsm_str", LOGIT_PROCESSORS) | ||
def test_logit_processor(logit_processor, fsm_str: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this test doing?
|
||
string = tokenizer.convert_tokens_to_string([token]) | ||
""" | ||
adapted_tokenizer = _adapt_tokenizer(llm.tokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be llm.tokenizer.tokenizer
|
||
return tokenizer | ||
""" | ||
adapted_tokenizer = _adapt_tokenizer(llm.tokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, llm.tokenizer.tokenizer
I added CFG in vLLM in this PR vllm-project/vllm#3211 |
We can close this PR as soon as the PR on vLLM is merged. We will also need to remove the vLLM-related code in the repo. |
Thank you for contributing! vLLM is currently implementing this on their end, so I will close this PR for now. |
Second attempt to add CFG support to vllm.
Context:
Currently vllm supports Regex and JsonSchema for vllm serving. This PR tries to add CFG support to vllm serving.
vllm has
vllm.LLM
class to handle offline inference andvllm._AsyncLLMEngine
to handle async multi node serving.vllm._AsyncLLMEngine
is an instance ofvllm.LLMEngine
andvllm.LLM
has an underlyingengine
property which is an instance ofvllm.LLMEngine
. This is important because outlines need to get the tokenizer from the vllm and tailor it to its needs andvllm.LLM
andvllm.LLMEngine
has two distinct way of getting the tokenizer.Outlines code has some special needs for the tokenizer API that has some slight differences with the vllm's tokenizer. Because of that it has a
adap_tokenizer
function, to tailor the tokenizer to its needs.Current bug (Thanks to #536 #535):
vllm._AsyncLLMEngine
(as demonstrated inoutlines/serve/serve.py
), although it mentionedvllm.LLM
as argument type in documentation. So the documentation is currently wrong. Because of this expectation, code inexamples/vllm_integration.py
fails.Proposed Solution:
Since vllm mentioned that the
vllm.LLMEngine
is the main class for vllm Engine, Outlines to expectvllm.LLMEngine
in it's LogitProcessors and get the tokenizer from that.Other solution can be supporting both and branching by a
hasattr
.Currently vllm doesnt support vllm.LLM class as input in to the logit processor and thats the error cause in the example file.