-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend] Don't log duplicate error stacktrace for every request in the batch #9023
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
da442ba
to
0b18f78
Compare
@@ -164,7 +164,7 @@ async def test_failed_abort(tmp_socket): | |||
sampling_params=SamplingParams(max_tokens=10), | |||
request_id=uuid.uuid4()): | |||
pass | |||
assert "KeyError" in repr(execinfo.value) | |||
assert "MQEngineDeadError" in repr(execinfo.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change might break the logic for this test?
IIUC, what was being done previously was that a KeyError
was being raised in the engine, which then caused an MQEngineDeadError
. The test then checks that the original KeyError
is still referenced in the raised MQEngineDeadError
.
I think it's still important for the original error to be surfaced once, is this change here intentional as part of not repeating the stack trace a bunch of times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh Wait... That is not what I mean neither. My intention was to put a MQEngineBatchError
. Did a quick local fix, and it worked as I excpected as well.
I think it's still important for the original error to be surfaced once, is this change here intentional as part of not repeating the stack trace a bunch of times?
Totally agree. But thinking the system as whole, this error is logged at least once in this snippet:
#repo/vllm/engine/multiprocessing/engine.py
def start(self):
try:
try:
logger.debug("Starting Startup Loop.")
self.run_startup_loop()
logger.debug("Starting heartbeat thread")
self.heartbeat_thread.start()
logger.debug("Starting Engine Loop.")
self.run_engine_loop()
except Exception as e:
logger.exception(repr(e))
is this change here intentional as part of not repeating the stack trace a bunch of times?
The challenge here from the client side, previously when there is an error (in the batch) all requests receives the same exception and after propagate this exception to the server layer where they are logged a lot. Therefore, you request might receive an exception informing KeyError
which is nothing related to it, but an exception raised by another test. That's the why, for now, I think might make sense change this test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah I gotcha. We don't want to log the full stack trace here for the original RAISED_ERROR
since it was already logged when that error actually happened. However, looking at the example server logs you posted, it does look like we are at least keeping the string repr of the original exception on the propagated exception here:
ERROR 10-02 13:00:45 launcher.py:111] MQEngineBatchError("A batch generation failed. Inspect the stacktrace to find the original error: Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')")
(the FORCED EXCEPTION
is there from your original exception)
I probably would not change the RAISED_ERROR
here to be a MQEngineBatchError
, because this is the error that we are making the underlying LLMEngine
raise, and it will never do that. I think instead here we should keep a different error (KeyError
is probably still fine) and check that the message from it makes it in here.
This should work, right?
RAISED_ERROR = KeyError("foo")
...
assert "foo" in repr(execinfo.value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just updated.
I went back to the original code:
RAISED_ERROR = KeyError
I did that before because there are some lines that we have:
with pytest.raises(RAISED_ERROR):
After carefully reading the tests again, I actually had to change only few lines (in other places) to preserve the original tests. I think now makes more sense, thank you for help me realize that!
This should work, right?
RAISED_ERROR = KeyError("foo")
...
assert "foo" in repr(execinfo.value)
Almost, I printed this test and I got:
Engine loop is not running. Inspect the stacktrace to find the original error: KeyError().
In this test this KeyError is raised from a request that does no exist:
# Trigger an abort on the client side.
# This request ID does not exist, and will cause the engine to error
await client.abort(request_id="foo")
Not sure why I had to change before. I reverted and the tests run fine.
"stacktrace to find the original error: " | ||
f"{repr(exception)}") | ||
# If it is a runtime exception, we assume that | ||
# the engine is already dead, let's pass this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think maybe I see- we're trying to send MQEngineDeadError
instead of MQEngineBatchError
if the engine is already dead... 🤔
I'm not sure that checking for RuntimeError
is the most robust, we should be able to instead check if self.errored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... I broke a test before because of this RuntimeError. There is already an exception handler that expect a RuntimeError and it will shutdown the server next (In this case it already keep the behavior that we expect in this PR: without duplicated error stacktrace).
# repo/vllm/entrypoints/launcher.py
@app.exception_handler(RuntimeError)
async def runtime_error_handler(request: Request, __):
"""On generic runtime error, check to see if the engine has died.
It probably has, in which case the server will no longer be able to
handle requests. Trigger a graceful shutdown with a SIGTERM."""
engine = request.app.state.engine_client
if (not envs.VLLM_KEEP_ALIVE_ON_ENGINE_DEATH and engine.errored
and not engine.is_running):
logger.fatal("AsyncLLMEngine has failed, terminating server "
"process")
# See discussions here on shutting down a uvicorn server
# https://github.com/encode/uvicorn/discussions/1103
# In this case we cannot await the server shutdown here because
# this handler must first return to close the connection for
# this request.
server.should_exit = True
So, in this PR one of my contribution was to add a flow that handle Exception that are not RuntimeErrors, which cause the log with the stacktrace printed multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the intent behind the extra RuntimeError
there was to try to catch anything that might have killed the engine, but did not raise an EngineDeadError
, which ideally should not happen. It's kinda equivalent to
try:
handle_request(req)
except EngineDeadError:
engine_dead_handler(req)
except RuntimeError:
runtime_error_handler(req)
Maybe that should actually be changed to @app.exception_handler(Exception)
to catch everything? Would that solve this problem entirely by just doing that? I had originally written these handlers, and I think I may have just made a bad assumption that unexpected things would be RuntimeError
s, which is probably often true but not always true. (It's not true for your test case where you explicitly raised Exception
😄 )
Then regardless of the server handling, I do think the error handling here in the engine should be more robust than just checking for RuntimeError
. If we want special logic based on whether the engine is already dead, then I think we can do:
batch_error = MQEngineDeadError(msg) if self.errored else MQEngineBatchError(msg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe that should actually be changed to @app.exception_handler(Exception) to catch everything?
Sort of, I added a custom for MQEngineBatchError
because we are sure that the engine had an error and it was logged before. But, what if we got an unhandled exception not related to the engine? I think in those case let the exception go up to the server and let FastAPI/Uvicorn do their thing.
batch_error = MQEngineDeadError(msg) if self.errored else MQEngineBatchError(msg)
Today, in practice any exception will eventually raise a MQEngineDeadError.
# Set to error state only on engine critical error
# (and record only the first one)
if is_engine_errored and not self._errored_with:
self._errored_with = exception
Honestly, I don't know the best solution, I wrote my solution based on what I read the code and tried to keep the old the behavior. I found two scenarios: i) runtime exception, shutdown server immediately; ii) other exceptions, set the engine erroed, when another request arrive then throws MQEngineDeadError and shutdown the server next. A simplification would be treat everything as runtime errors...
I introduced MQEngineBatchError thinking that in the future vLLM would be more robust to those errors, raise an error and keep up and running, but maybe it is a too soon feature. I would like hear more opinions on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the clarification @wallashss! This is definitely a bit confusing to hold it all in my head 😅
But, what if we got an unhandled exception not related to the engine? I think in those case let the exception go up to the server and let FastAPI/Uvicorn do their thing.
That's a good distinction, yeah. I'm okay with leaving stack traces in for things that we don't think were caused by an exception that killed the engine.
ii) other exceptions, set the engine erroed, when another request arrive then throws MQEngineDeadError and shutdown the server next. A simplification would be treat everything as runtime errors...
Oh, so actually I don't think we want to wait for the second request to come in to throw MQEngineDeadError
, we should probably throw that immediately so that the server shuts down asap when it can no longer process requests
I introduced MQEngineBatchError thinking that in the future vLLM would be more robust to those errors, raise an error and keep up and running, but maybe it is a too soon feature
Yeah, it's a nice touch but I don't think we currently have the robustness to use it, like you say any exception will kill the engine :(
This is super close, I think I just have these two comments left then:
- I'd rather not include the new error type
MQEngineBatchError
since we don't yet have a case where an exception is raised but the engine stays alive. (My care amount on this is low) - I want to make sure we're always killing the engine and server ASAP once it errors. (My care amount on this is high). I think this is already mostly covered by your code, we'll just want to make sure that we don't run into the case you mentioned about raising one error first and then waiting for the next request to raise an
MQEngineDeadError
. Maybe a catch-all for this could be like
@app.exception_handler(Exception)
async def handler(r, e):
if engine.errored:
# log just the exception message
# shut down server
else:
# log full stack trace
But I think that's very ham-fisted and probably not necessary 😉
Thanks for the review @joerunde! See my comments and check if it make sense. I am totally open to review ideas and change the implementation. |
0b18f78
to
e72dd5f
Compare
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
a45b4d2
to
f2ef73c
Compare
…the batch Signed-off-by: Wallas Santos <[email protected]>
f2ef73c
to
e1c161e
Compare
# exception for a batch, and we may not know the | ||
# request that caused it, neither if it was actually | ||
# caused by any of them (e.g. CUDA OOM). Therefore we | ||
# broadcast the same exception for all requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, love the explanation here!
assert client.errored | ||
|
||
# Engine is errored, should get ENGINE_DEAD_ERROR. | ||
# Throws an error that should get ENGINE_DEAD_ERROR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also check a "batch" of requests here, like
def do_generate(client):
async for _ in client.generate(prompt="Hello my name is",
sampling_params=SamplingParams(),
request_id=uuid.uuid4()):
pass
...
# (in this test)
tasks = [asyncio.create_task(do_generate(client)) for _ in range(10)]
# Check that every `task` in `tasks` failed with `MQEngineDeadError`
That should test that we don't get the big spew of stack traces, since every request will raise an error type that doesn't log the stack trace
Signed-off-by: Wallas Santos <[email protected]>
# should get the same exception as a MQEngineDeadError. | ||
errors = await asyncio.gather(*tasks, return_exceptions=True) | ||
for e in errors: | ||
assert "KeyError" in repr(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wallashss I think we need to assert that these errors are also MQEngineDeadError
s here, then we're good to go
Signed-off-by: Wallas Santos <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wallashss @joerunde!
…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: charlifu <[email protected]>
…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: Vinay Damodaran <[email protected]>
…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: Alvant <[email protected]>
…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: Erkin Sagiroglu <[email protected]>
…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: Amit Garg <[email protected]>
…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: qishuai <[email protected]>
…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>
…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]>
…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>
…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>
Don't log duplicate error stacktrace for every request in the batch
EDIT: Discussed with @joerunde, and we changed the solution. On the
client.py
if the engine is errored, then the server should shut down, to do that we have to make sure to send a MQEngineDeadError despite the type of the exception. Previously, the server only shut down if the exception is a RuntimeError, and for those cases it does not log replicated stacktrace. With that we removed the exception MQEngineBatchError, because does not make sense anymore and we achieved the goal of this PR, because on launcher there is already an exception handler that does not replicate logs when it catches MQEngineBatchError.Currently if there is an error in the engine when processing the current batch, the whole stacktrace ends up getting logged for every request. This PR address this issue to improve the server log readability.
Steps to reproduce
I created a script that will keep sending requests to the server in parallel to make the engine busy and batching multiple requests.
On the vLLM side I added a hardcoded check to force an exception (not sure if there is a better way to do that). Basically, if it receives a request with the
max_tokens==123
then it raises an exception.Then, just send a poisoned curl with
max_tokens==123
to make the server crash.Following the log of the server for this scenario:
Server log
Solution
I created a new custom exception:
MQEngineBatchError
that raises when the engine has error while processing a batch. When this exception is propagated to the HTTP the server, before uvicorn/fastapi tries to log it and cause the log pollution, a custom exception handler will only print a line to inform that was an error on batch of this request. Still have a print for each request, because we have to at least give a feedback why this request failed, however the full stacktrace will only be logged once as we can see in this snippet:If this change we can have an output log as follow:
New Server log