[Frontend] Don't log duplicate error stacktrace for every request in the batch #9023

wallashss · 2024-10-02T13:36:36Z

Don't log duplicate error stacktrace for every request in the batch

EDIT: Discussed with @joerunde, and we changed the solution. On the client.py if the engine is errored, then the server should shut down, to do that we have to make sure to send a MQEngineDeadError despite the type of the exception. Previously, the server only shut down if the exception is a RuntimeError, and for those cases it does not log replicated stacktrace. With that we removed the exception MQEngineBatchError, because does not make sense anymore and we achieved the goal of this PR, because on launcher there is already an exception handler that does not replicate logs when it catches MQEngineBatchError.

Currently if there is an error in the engine when processing the current batch, the whole stacktrace ends up getting logged for every request. This PR address this issue to improve the server log readability.

Steps to reproduce

I created a script that will keep sending requests to the server in parallel to make the engine busy and batching multiple requests.

import requests 
import json
from multiprocessing import Pool

prompts = ["how to make cheesecake", "How to make pizza", "Who is afraid of the big bad wolf?", "Who is the president of Brazil", "What is a capital of Spain", "Who is the president of USA"]


def do_generate(idx):
    
    while True:
        
        data = {
            'model': "ibm/merlinite-7b",
            'prompt': [prompts[idx]],
            'max_tokens': 200,
            "temperature": 0,
            "stream": False
        }

        res = requests.post('http://localhost:8000/v1/completions', data=json.dumps(data), headers={"Content-Type": "application/json"})

        print(json.dumps(res.json(), indent=2) )

if __name__ == "__main__": 
    indices = [1, 2, 3, 4, 5, 6]
    with Pool(len(indices)) as p:
        p.map(do_generate, indices)

On the vLLM side I added a hardcoded check to force an exception (not sure if there is a better way to do that). Basically, if it receives a request with the max_tokens==123 then it raises an exception.

@torch.inference_mode()
    @dump_input_when_exception(exclude_args=[0], exclude_kwargs=["self"])
    def execute_model(
        self,
        model_input: ModelInputForGPUWithSamplingMetadata,
        kv_caches: List[torch.Tensor],
        intermediate_tensors: Optional[IntermediateTensors] = None,
        num_steps: int = 1,
    ) -> Optional[Union[List[SamplerOutput], IntermediateTensors]]:
        
        for g in model_input.sampling_metadata.seq_groups:
            if g.sampling_params.max_tokens == 123: 
                raise Exception("FORCED EXCEPTION")

Then, just send a poisoned curl with max_tokens==123 to make the server crash.

curl http://localhost:8000/v1/completions -H "Content-Type: application/json"   -d '{
    "model": "ibm/merlinite-7b",
    "prompt": ["How to make pizza"],
    "max_tokens": 123,
    "temperature": 0
  }'

Following the log of the server for this scenario:

Server log

INFO 10-01 18:43:01 api_server.py:520] vLLM API server version 0.6.1.post2
INFO 10-01 18:43:01 api_server.py:521] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/shared_model_storage/transformers_cache/models--ibm--merlinite-7b/snapshots/233d12759d5bb9344231dafdb51310ec19d79c0e/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
[...]
INFO 10-01 18:43:27 logger.py:36] Received request cmpl-c52dbd9a95114d1cbfa182e588138c86-0: prompt: 'How to make pizza', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1602, 298, 1038, 20727], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:27 engine.py:255] Added request cmpl-c52dbd9a95114d1cbfa182e588138c86-0.
INFO 10-01 18:43:27 logger.py:36] Received request cmpl-cf480cd01b3f45f9a44ee90f910b6dc7-0: prompt: 'Who is afraid of the big bad wolf?', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 8526, 302, 272, 2032, 2607, 24100, 28804], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:27 logger.py:36] Received request cmpl-a81e48bee0294b0b84cb89766e3747b1-0: prompt: 'Who is the president of Brazil', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 13250], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:27 logger.py:36] Received request cmpl-91fe755b3d0248ab94b2671eae5be62b-0: prompt: 'What is a capital of Spain', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1824, 349, 264, 5565, 302, 12567], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:27 logger.py:36] Received request cmpl-62a2bd68b2514c3997280fd68ae3bf38-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:27 metrics.py:351] Avg prompt throughput: 0.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 10-01 18:43:27 engine.py:255] Added request cmpl-cf480cd01b3f45f9a44ee90f910b6dc7-0.
INFO 10-01 18:43:27 engine.py:255] Added request cmpl-a81e48bee0294b0b84cb89766e3747b1-0.
INFO 10-01 18:43:27 engine.py:255] Added request cmpl-91fe755b3d0248ab94b2671eae5be62b-0.
INFO 10-01 18:43:27 engine.py:255] Added request cmpl-62a2bd68b2514c3997280fd68ae3bf38-0.
INFO:     ::1:51364 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:28 logger.py:36] Received request cmpl-c5044c83c77b430d9038ecd2b7d1089e-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:28 engine.py:255] Added request cmpl-c5044c83c77b430d9038ecd2b7d1089e-0.
INFO:     ::1:51372 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:28 logger.py:36] Received request cmpl-3af6a00b38db4987a36242d859e5cf93-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:28 engine.py:255] Added request cmpl-3af6a00b38db4987a36242d859e5cf93-0.
INFO:     ::1:51374 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:29 logger.py:36] Received request cmpl-c9c4172a7de346d7ac1d38b3c2ef8446-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:29 engine.py:255] Added request cmpl-c9c4172a7de346d7ac1d38b3c2ef8446-0.
INFO:     ::1:51380 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:29 logger.py:36] Received request cmpl-dbc219febf5f4375ae74c0eed37c30a0-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:29 engine.py:255] Added request cmpl-dbc219febf5f4375ae74c0eed37c30a0-0.
INFO:     ::1:51388 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:30 logger.py:36] Received request cmpl-97b0729b39ac464e98518e9773d42fff-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:30 engine.py:255] Added request cmpl-97b0729b39ac464e98518e9773d42fff-0.
INFO:     ::1:51402 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:30 logger.py:36] Received request cmpl-21dd2f2709064410bcd6c66eda41d1da-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:30 engine.py:255] Added request cmpl-21dd2f2709064410bcd6c66eda41d1da-0.
INFO:     ::1:51412 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:31 logger.py:36] Received request cmpl-6837e12c5c3c4ac5aaf6bbe816dd2a69-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:31 engine.py:255] Added request cmpl-6837e12c5c3c4ac5aaf6bbe816dd2a69-0.
INFO:     ::1:51426 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:31 logger.py:36] Received request cmpl-69faaae7ba6b48e58655c07a2e7ba046-0: prompt: 'How to make pizza', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=123, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1602, 298, 1038, 20727], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:31 engine.py:255] Added request cmpl-69faaae7ba6b48e58655c07a2e7ba046-0.
INFO 10-01 18:43:31 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241001-184331.pkl...
INFO 10-01 18:43:31 model_runner_base.py:141] Completed writing input of failed execution to /tmp/err_execute_model_input_20241001-184331.pkl.
INFO:     ::1:51348 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-01 18:43:31 engine.py:130] Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241001-184331.pkl): FORCED EXCEPTION')
ERROR 10-01 18:43:31 engine.py:130] Traceback (most recent call last):
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 10-01 18:43:31 engine.py:130]     return func(*args, **kwargs)
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/worker/model_runner.py", line 1548, in execute_model
ERROR 10-01 18:43:31 engine.py:130]     raise Exception("FORCED EXCEPTION")
ERROR 10-01 18:43:31 engine.py:130] Exception: FORCED EXCEPTION
ERROR 10-01 18:43:31 engine.py:130] 
ERROR 10-01 18:43:31 engine.py:130] The above exception was the direct cause of the following exception:
ERROR 10-01 18:43:31 engine.py:130] 
ERROR 10-01 18:43:31 engine.py:130] Traceback (most recent call last):
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 128, in start
ERROR 10-01 18:43:31 engine.py:130]     self.run_engine_loop()
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 188, in run_engine_loop
ERROR 10-01 18:43:31 engine.py:130]     request_outputs = self.engine_step()
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 207, in engine_step
ERROR 10-01 18:43:31 engine.py:130]     raise e
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 198, in engine_step
ERROR 10-01 18:43:31 engine.py:130]     return self.engine.step()
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/engine/llm_engine.py", line 1228, in step
ERROR 10-01 18:43:31 engine.py:130]     outputs = self.model_executor.execute_model(
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/executor/gpu_executor.py", line 130, in execute_model
ERROR 10-01 18:43:31 engine.py:130]     output = self.driver_worker.execute_model(execute_model_req)
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 10-01 18:43:31 engine.py:130]     output = self.model_runner.execute_model(
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/site-packages/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-01 18:43:31 engine.py:130]     return func(*args, **kwargs)
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/worker/model_runner_base.py", line 144, in _wrapper
ERROR 10-01 18:43:31 engine.py:130]     raise type(err)(
ERROR 10-01 18:43:31 engine.py:130] Exception: Error in model execution (input dumped to /tmp/err_execute_model_input_20241001-184331.pkl): FORCED EXCEPTION
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/tmp/vllm/vllm/entrypoints/openai/api_server.py", line 327, in create_completion
    generator = await completion(raw_request).create_completion(
  File "/tmp/vllm/vllm/entrypoints/openai/serving_completion.py", line 178, in create_completion
    async for i, res in result_generator:
  File "/tmp/vllm/vllm/utils.py", line 488, in merge_async_iterators
    item = await d
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
Exception: Error in model execution (input dumped to /tmp/err_execute_model_input_20241001-184331.pkl): FORCED EXCEPTION
INFO:     ::1:51352 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/tmp/vllm/vllm/entrypoints/openai/api_server.py", line 327, in create_completion
    generator = await completion(raw_request).create_completion(
  File "/tmp/vllm/vllm/entrypoints/openai/serving_completion.py", line 178, in create_completion
    async for i, res in result_generator:
  File "/tmp/vllm/vllm/utils.py", line 488, in merge_async_iterators
    item = await d
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
Exception: Error in model execution (input dumped to /tmp/err_execute_model_input_20241001-184331.pkl): FORCED EXCEPTION
INFO:     ::1:51356 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/tmp/vllm/vllm/entrypoints/openai/api_server.py", line 327, in create_completion
    generator = await completion(raw_request).create_completion(
  File "/tmp/vllm/vllm/entrypoints/openai/serving_completion.py", line 178, in create_completion
    async for i, res in result_generator:
  File "/tmp/vllm/vllm/utils.py", line 488, in merge_async_iterators
    item = await d
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
Exception: Error in model execution (input dumped to /tmp/err_execute_model_input_20241001-184331.pkl): FORCED EXCEPTION
INFO:     ::1:51354 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/tmp/vllm/vllm/entrypoints/openai/api_server.py", line 327, in create_completion
    generator = await completion(raw_request).create_completion(
  File "/tmp/vllm/vllm/entrypoints/openai/serving_completion.py", line 178, in create_completion
    async for i, res in result_generator:
  File "/tmp/vllm/vllm/utils.py", line 488, in merge_async_iterators
    item = await d
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  [Previous line repeated 1 more time]
Exception: Error in model execution (input dumped to /tmp/err_execute_model_input_20241001-184331.pkl): FORCED EXCEPTION
INFO:     ::1:51438 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/tmp/vllm/vllm/entrypoints/openai/api_server.py", line 327, in create_completion
    generator = await completion(raw_request).create_completion(
  File "/tmp/vllm/vllm/entrypoints/openai/serving_completion.py", line 178, in create_completion
    async for i, res in result_generator:
  File "/tmp/vllm/vllm/utils.py", line 488, in merge_async_iterators
    item = await d
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  [Previous line repeated 2 more times]
Exception: Error in model execution (input dumped to /tmp/err_execute_model_input_20241001-184331.pkl): FORCED EXCEPTION
CRITICAL 10-01 18:43:31 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     ::1:51454 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [62660]

Solution

I created a new custom exception: MQEngineBatchError that raises when the engine has error while processing a batch. When this exception is propagated to the HTTP the server, before uvicorn/fastapi tries to log it and cause the log pollution, a custom exception handler will only print a line to inform that was an error on batch of this request. Still have a print for each request, because we have to at least give a feedback why this request failed, however the full stacktrace will only be logged once as we can see in this snippet:

#vllm/engine/multiprocessing/engine.py
def start(self):
    try:
        try:
            logger.debug("Starting Startup Loop.")
            self.run_startup_loop()
            logger.debug("Starting heartbeat thread")
            self.heartbeat_thread.start()
            logger.debug("Starting Engine Loop.")
            self.run_engine_loop()
        except Exception as e:
            logger.exception(repr(e)) # HERE: Log the exception for the first time

If this change we can have an output log as follow:

New Server log

INFO 10-02 13:00:19 api_server.py:520] vLLM API server version 0.6.1.post2
INFO 10-02 13:00:19 api_server.py:521] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/shared_model_storage/transformers_cache/models--ibm--merlinite-7b/snapshots/233d12759d5bb9344231dafdb51310ec19d79c0e/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
[...]
INFO 10-02 13:00:42 logger.py:36] Received request cmpl-2cda52baff774a22ae96289bfe6633cf-0: prompt: 'How to make pizza', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1602, 298, 1038, 20727], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:42 engine.py:255] Added request cmpl-2cda52baff774a22ae96289bfe6633cf-0.
INFO 10-02 13:00:42 logger.py:36] Received request cmpl-3ab8b1fe7d3946de8143ce0ae9ce9652-0: prompt: 'Who is afraid of the big bad wolf?', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 8526, 302, 272, 2032, 2607, 24100, 28804], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:42 logger.py:36] Received request cmpl-71db974fad6f4d9ebd9b449c17b3c033-0: prompt: 'What is a capital of Spain', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1824, 349, 264, 5565, 302, 12567], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:42 logger.py:36] Received request cmpl-01665a5bcabd456e967097e5ad5572b2-0: prompt: 'Who is the president of Brazil', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 13250], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:42 logger.py:36] Received request cmpl-8d0708b726264b05be87395de88f1fc6-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:42 metrics.py:351] Avg prompt throughput: 0.5 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 10-02 13:00:42 engine.py:255] Added request cmpl-3ab8b1fe7d3946de8143ce0ae9ce9652-0.
INFO 10-02 13:00:42 engine.py:255] Added request cmpl-71db974fad6f4d9ebd9b449c17b3c033-0.
INFO 10-02 13:00:42 engine.py:255] Added request cmpl-01665a5bcabd456e967097e5ad5572b2-0.
INFO 10-02 13:00:42 engine.py:255] Added request cmpl-8d0708b726264b05be87395de88f1fc6-0.
INFO:     ::1:36138 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-02 13:00:43 logger.py:36] Received request cmpl-f49daa3545254a1080057dee3495b058-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:43 engine.py:255] Added request cmpl-f49daa3545254a1080057dee3495b058-0.
INFO:     ::1:41382 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-02 13:00:44 logger.py:36] Received request cmpl-4263620a4dad440ba4b6c5657557390d-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:44 engine.py:255] Added request cmpl-4263620a4dad440ba4b6c5657557390d-0.
INFO:     ::1:41390 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-02 13:00:44 logger.py:36] Received request cmpl-0d82c4a335be43eaa03d739410192c22-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:44 engine.py:255] Added request cmpl-0d82c4a335be43eaa03d739410192c22-0.
INFO:     ::1:41392 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-02 13:00:45 logger.py:36] Received request cmpl-9740702ee25c4380861024255ea8b5f6-0: prompt: 'How to make pizza', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=123, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1602, 298, 1038, 20727], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:45 engine.py:255] Added request cmpl-9740702ee25c4380861024255ea8b5f6-0.
INFO 10-02 13:00:45 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241002-130045.pkl...
INFO 10-02 13:00:45 model_runner_base.py:141] Completed writing input of failed execution to /tmp/err_execute_model_input_20241002-130045.pkl.
ERROR 10-02 13:00:45 launcher.py:111] MQEngineBatchError("A batch generation failed. Inspect the stacktrace to find the original error: Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')")
INFO:     ::1:36130 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-02 13:00:45 launcher.py:111] MQEngineBatchError("A batch generation failed. Inspect the stacktrace to find the original error: Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')")
INFO:     ::1:36132 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-02 13:00:45 engine.py:130] Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')
ERROR 10-02 13:00:45 engine.py:130] Traceback (most recent call last):
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 10-02 13:00:45 engine.py:130]     return func(*args, **kwargs)
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/worker/model_runner.py", line 1549, in execute_model
ERROR 10-02 13:00:45 engine.py:130]     raise Exception("FORCED EXCEPTION")
ERROR 10-02 13:00:45 engine.py:130] Exception: FORCED EXCEPTION
ERROR 10-02 13:00:45 engine.py:130] 
ERROR 10-02 13:00:45 engine.py:130] The above exception was the direct cause of the following exception:
ERROR 10-02 13:00:45 engine.py:130] 
ERROR 10-02 13:00:45 engine.py:130] Traceback (most recent call last):
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 128, in start
ERROR 10-02 13:00:45 engine.py:130]     self.run_engine_loop()
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 188, in run_engine_loop
ERROR 10-02 13:00:45 engine.py:130]     request_outputs = self.engine_step()
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 207, in engine_step
ERROR 10-02 13:00:45 engine.py:130]     raise e
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 198, in engine_step
ERROR 10-02 13:00:45 engine.py:130]     return self.engine.step()
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/engine/llm_engine.py", line 1228, in step
ERROR 10-02 13:00:45 engine.py:130]     outputs = self.model_executor.execute_model(
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/executor/gpu_executor.py", line 130, in execute_model
ERROR 10-02 13:00:45 engine.py:130]     output = self.driver_worker.execute_model(execute_model_req)
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 10-02 13:00:45 engine.py:130]     output = self.model_runner.execute_model(
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/env2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-02 13:00:45 engine.py:130]     return func(*args, **kwargs)
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/worker/model_runner_base.py", line 144, in _wrapper
ERROR 10-02 13:00:45 engine.py:130]     raise type(err)(
ERROR 10-02 13:00:45 engine.py:130] Exception: Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION
ERROR 10-02 13:00:45 launcher.py:111] MQEngineBatchError("A batch generation failed. Inspect the stacktrace to find the original error: Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')")
INFO:     ::1:36152 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-02 13:00:45 launcher.py:111] MQEngineBatchError("A batch generation failed. Inspect the stacktrace to find the original error: Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')")
INFO:     ::1:36136 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-02 13:00:45 launcher.py:111] MQEngineBatchError("A batch generation failed. Inspect the stacktrace to find the original error: Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')")
INFO:     ::1:41404 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 10-02 13:00:45 launcher.py:100] MQLLMEngine is already dead, terminating server process
INFO:     ::1:41408 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [64354]

github-actions · 2024-10-02T13:36:49Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

joerunde · 2024-10-03T18:24:38Z

tests/mq_llm_engine/test_error_handling.py

@@ -164,7 +164,7 @@ async def test_failed_abort(tmp_socket):
                    sampling_params=SamplingParams(max_tokens=10),
                    request_id=uuid.uuid4()):
                pass
-        assert "KeyError" in repr(execinfo.value)
+        assert "MQEngineDeadError" in repr(execinfo.value)


I think this change might break the logic for this test?

IIUC, what was being done previously was that a KeyError was being raised in the engine, which then caused an MQEngineDeadError. The test then checks that the original KeyError is still referenced in the raised MQEngineDeadError.

I think it's still important for the original error to be surfaced once, is this change here intentional as part of not repeating the stack trace a bunch of times?

Oh Wait... That is not what I mean neither. My intention was to put a MQEngineBatchError. Did a quick local fix, and it worked as I excpected as well.

I think it's still important for the original error to be surfaced once, is this change here intentional as part of not repeating the stack trace a bunch of times?

Totally agree. But thinking the system as whole, this error is logged at least once in this snippet:

#repo/vllm/engine/multiprocessing/engine.py def start(self): try: try: logger.debug("Starting Startup Loop.") self.run_startup_loop() logger.debug("Starting heartbeat thread") self.heartbeat_thread.start() logger.debug("Starting Engine Loop.") self.run_engine_loop() except Exception as e: logger.exception(repr(e))

is this change here intentional as part of not repeating the stack trace a bunch of times?

The challenge here from the client side, previously when there is an error (in the batch) all requests receives the same exception and after propagate this exception to the server layer where they are logged a lot. Therefore, you request might receive an exception informing KeyError which is nothing related to it, but an exception raised by another test. That's the why, for now, I think might make sense change this test here.

Ah, yeah I gotcha. We don't want to log the full stack trace here for the original RAISED_ERROR since it was already logged when that error actually happened. However, looking at the example server logs you posted, it does look like we are at least keeping the string repr of the original exception on the propagated exception here:

ERROR 10-02 13:00:45 launcher.py:111] MQEngineBatchError("A batch generation failed. Inspect the stacktrace to find the original error: Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')")

(the FORCED EXCEPTION is there from your original exception)

I probably would not change the RAISED_ERROR here to be a MQEngineBatchError, because this is the error that we are making the underlying LLMEngine raise, and it will never do that. I think instead here we should keep a different error (KeyError is probably still fine) and check that the message from it makes it in here.

This should work, right?

RAISED_ERROR = KeyError("foo") ... assert "foo" in repr(execinfo.value)

Just updated.

I went back to the original code:

RAISED_ERROR = KeyError

I did that before because there are some lines that we have:

with pytest.raises(RAISED_ERROR):

After carefully reading the tests again, I actually had to change only few lines (in other places) to preserve the original tests. I think now makes more sense, thank you for help me realize that!

This should work, right?

RAISED_ERROR = KeyError("foo")

...

assert "foo" in repr(execinfo.value)

Almost, I printed this test and I got:

Engine loop is not running. Inspect the stacktrace to find the original error: KeyError().

In this test this KeyError is raised from a request that does no exist:

# Trigger an abort on the client side. # This request ID does not exist, and will cause the engine to error await client.abort(request_id="foo")

Not sure why I had to change before. I reverted and the tests run fine.

joerunde · 2024-10-03T18:30:00Z

vllm/engine/multiprocessing/client.py

+                                      "stacktrace to find the original error: "
+                                      f"{repr(exception)}")
+                            # If it is a runtime exception, we assume that
+                            # the engine is already dead, let's pass this


Ah I think maybe I see- we're trying to send MQEngineDeadError instead of MQEngineBatchError if the engine is already dead... 🤔

I'm not sure that checking for RuntimeError is the most robust, we should be able to instead check if self.errored

Yeah... I broke a test before because of this RuntimeError. There is already an exception handler that expect a RuntimeError and it will shutdown the server next (In this case it already keep the behavior that we expect in this PR: without duplicated error stacktrace).

# repo/vllm/entrypoints/launcher.py @app.exception_handler(RuntimeError) async def runtime_error_handler(request: Request, __): """On generic runtime error, check to see if the engine has died. It probably has, in which case the server will no longer be able to handle requests. Trigger a graceful shutdown with a SIGTERM.""" engine = request.app.state.engine_client if (not envs.VLLM_KEEP_ALIVE_ON_ENGINE_DEATH and engine.errored and not engine.is_running): logger.fatal("AsyncLLMEngine has failed, terminating server " "process") # See discussions here on shutting down a uvicorn server # https://github.com/encode/uvicorn/discussions/1103 # In this case we cannot await the server shutdown here because # this handler must first return to close the connection for # this request. server.should_exit = True

So, in this PR one of my contribution was to add a flow that handle Exception that are not RuntimeErrors, which cause the log with the stacktrace printed multiple times.

So, the intent behind the extra RuntimeError there was to try to catch anything that might have killed the engine, but did not raise an EngineDeadError, which ideally should not happen. It's kinda equivalent to

try: handle_request(req) except EngineDeadError: engine_dead_handler(req) except RuntimeError: runtime_error_handler(req)

Maybe that should actually be changed to @app.exception_handler(Exception) to catch everything? Would that solve this problem entirely by just doing that? I had originally written these handlers, and I think I may have just made a bad assumption that unexpected things would be RuntimeErrors, which is probably often true but not always true. (It's not true for your test case where you explicitly raised Exception 😄 )

Then regardless of the server handling, I do think the error handling here in the engine should be more robust than just checking for RuntimeError. If we want special logic based on whether the engine is already dead, then I think we can do:

batch_error = MQEngineDeadError(msg) if self.errored else MQEngineBatchError(msg)

Maybe that should actually be changed to @app.exception_handler(Exception) to catch everything?

Sort of, I added a custom for MQEngineBatchError because we are sure that the engine had an error and it was logged before. But, what if we got an unhandled exception not related to the engine? I think in those case let the exception go up to the server and let FastAPI/Uvicorn do their thing.

batch_error = MQEngineDeadError(msg) if self.errored else MQEngineBatchError(msg)

Today, in practice any exception will eventually raise a MQEngineDeadError.

# Set to error state only on engine critical error # (and record only the first one) if is_engine_errored and not self._errored_with: self._errored_with = exception

Honestly, I don't know the best solution, I wrote my solution based on what I read the code and tried to keep the old the behavior. I found two scenarios: i) runtime exception, shutdown server immediately; ii) other exceptions, set the engine erroed, when another request arrive then throws MQEngineDeadError and shutdown the server next. A simplification would be treat everything as runtime errors...

I introduced MQEngineBatchError thinking that in the future vLLM would be more robust to those errors, raise an error and keep up and running, but maybe it is a too soon feature. I would like hear more opinions on that.

Thanks for all the clarification @wallashss! This is definitely a bit confusing to hold it all in my head 😅

But, what if we got an unhandled exception not related to the engine? I think in those case let the exception go up to the server and let FastAPI/Uvicorn do their thing.

That's a good distinction, yeah. I'm okay with leaving stack traces in for things that we don't think were caused by an exception that killed the engine.

ii) other exceptions, set the engine erroed, when another request arrive then throws MQEngineDeadError and shutdown the server next. A simplification would be treat everything as runtime errors...

Oh, so actually I don't think we want to wait for the second request to come in to throw MQEngineDeadError, we should probably throw that immediately so that the server shuts down asap when it can no longer process requests

I introduced MQEngineBatchError thinking that in the future vLLM would be more robust to those errors, raise an error and keep up and running, but maybe it is a too soon feature

Yeah, it's a nice touch but I don't think we currently have the robustness to use it, like you say any exception will kill the engine :(

This is super close, I think I just have these two comments left then:

I'd rather not include the new error type MQEngineBatchError since we don't yet have a case where an exception is raised but the engine stays alive. (My care amount on this is low)

I want to make sure we're always killing the engine and server ASAP once it errors. (My care amount on this is high). I think this is already mostly covered by your code, we'll just want to make sure that we don't run into the case you mentioned about raising one error first and then waiting for the next request to raise an MQEngineDeadError. Maybe a catch-all for this could be like

@app.exception_handler(Exception) async def handler(r, e): if engine.errored: # log just the exception message # shut down server else: # log full stack trace

But I think that's very ham-fisted and probably not necessary 😉

wallashss · 2024-10-03T20:04:28Z

Thanks for the review @joerunde! See my comments and check if it make sense. I am totally open to review ideas and change the implementation.

Signed-off-by: Wallas Santos <[email protected]>

…the batch Signed-off-by: Wallas Santos <[email protected]>

joerunde · 2024-10-15T16:03:23Z

vllm/engine/multiprocessing/client.py

+                        # exception for a batch, and we may not know the
+                        # request that caused it, neither if it was actually
+                        # caused by any of them (e.g. CUDA OOM). Therefore we
+                        # broadcast the same exception for all requests.


Nice, love the explanation here!

joerunde · 2024-10-15T16:10:57Z

tests/mq_llm_engine/test_error_handling.py

-        assert client.errored
-
-        # Engine is errored, should get ENGINE_DEAD_ERROR.
+        # Throws an error that should get ENGINE_DEAD_ERROR.


We could also check a "batch" of requests here, like

def do_generate(client): async for _ in client.generate(prompt="Hello my name is", sampling_params=SamplingParams(), request_id=uuid.uuid4()): pass ... # (in this test) tasks = [asyncio.create_task(do_generate(client)) for _ in range(10)] # Check that every `task` in `tasks` failed with `MQEngineDeadError`

That should test that we don't get the big spew of stack traces, since every request will raise an error type that doesn't log the stack trace

Signed-off-by: Wallas Santos <[email protected]>

joerunde · 2024-10-17T15:19:06Z

tests/mq_llm_engine/test_error_handling.py

+        # should get the same exception as a MQEngineDeadError.
+        errors = await asyncio.gather(*tasks, return_exceptions=True)
+        for e in errors:
+            assert "KeyError" in repr(e)


@wallashss I think we need to assert that these errors are also MQEngineDeadErrors here, then we're good to go

Signed-off-by: Wallas Santos <[email protected]>

njhill

Thanks @wallashss @joerunde!

…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: charlifu <[email protected]>

…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: Vinay Damodaran <[email protected]>

…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: Alvant <[email protected]>

…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: Erkin Sagiroglu <[email protected]>

…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: Amit Garg <[email protected]>

…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: qishuai <[email protected]>

…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>

…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]>

…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>

…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>

wallashss changed the title ~~[Frontend] Added MQEngineBatchError to improve stacktrace readability~~ [Frontend] Don't log duplicate error stacktrace for every request in the batch Oct 2, 2024

DarkLight1337 requested a review from robertgshaw2-redhat October 2, 2024 14:22

wallashss force-pushed the dont_duplicate_err branch 3 times, most recently from da442ba to 0b18f78 Compare October 3, 2024 13:23

joerunde reviewed Oct 3, 2024

View reviewed changes

wallashss force-pushed the dont_duplicate_err branch from 0b18f78 to e72dd5f Compare October 3, 2024 20:07

wallashss added a commit to wallashss/vllm that referenced this pull request Oct 8, 2024

[Frontend] Tests adjustments (vllm-project#9023)

a45b4d2

Signed-off-by: Wallas Santos <[email protected]>

wallashss added a commit to wallashss/vllm that referenced this pull request Oct 8, 2024

[Frontend] Tests adjustments (vllm-project#9023)

f2ef73c

Signed-off-by: Wallas Santos <[email protected]>

wallashss force-pushed the dont_duplicate_err branch from a45b4d2 to f2ef73c Compare October 8, 2024 20:30

[Frontend] Don't log duplicate error stacktrace for every request in …

e1c161e

…the batch Signed-off-by: Wallas Santos <[email protected]>

wallashss force-pushed the dont_duplicate_err branch from f2ef73c to e1c161e Compare October 12, 2024 16:29

joerunde reviewed Oct 15, 2024

View reviewed changes

joerunde mentioned this pull request Oct 15, 2024

[Bug]: TimeoutError: MQLLMEngine didn't reply within 10000ms #8836

Closed

1 task

[Frontend] improved test

f0f2920

Signed-off-by: Wallas Santos <[email protected]>

joerunde reviewed Oct 17, 2024

View reviewed changes

[Frontend] assert MQEngineDeadError on test_batch_error

a0c0532

Signed-off-by: Wallas Santos <[email protected]>

njhill approved these changes Oct 17, 2024

View reviewed changes

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 17, 2024

Merge remote-tracking branch 'wallashss/main' into dont_duplicate_err

68fbb83

njhill merged commit 711f3a7 into vllm-project:main Oct 21, 2024
59 checks passed

wallashss mentioned this pull request Oct 22, 2024

fix: duplicate stacktrace on engine error opendatahub-io/vllm-tgis-adapter#171

Merged

3 tasks

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Frontend] Don't log duplicate error stacktrace for every request in …

d51d9d4

…the batch (vllm-project#9023) Signed-off-by: Wallas Santos <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend] Don't log duplicate error stacktrace for every request in the batch #9023

[Frontend] Don't log duplicate error stacktrace for every request in the batch #9023

wallashss commented Oct 2, 2024 •

edited

Loading

github-actions bot commented Oct 2, 2024

joerunde Oct 3, 2024

wallashss Oct 3, 2024

joerunde Oct 8, 2024

wallashss Oct 8, 2024

joerunde Oct 3, 2024

wallashss Oct 3, 2024

joerunde Oct 8, 2024

wallashss Oct 8, 2024

joerunde Oct 9, 2024 •

edited

Loading

wallashss commented Oct 3, 2024

joerunde Oct 15, 2024

joerunde Oct 15, 2024

joerunde Oct 17, 2024

njhill left a comment

[Frontend] Don't log duplicate error stacktrace for every request in the batch #9023

[Frontend] Don't log duplicate error stacktrace for every request in the batch #9023

Conversation

wallashss commented Oct 2, 2024 • edited Loading

Don't log duplicate error stacktrace for every request in the batch

Steps to reproduce

Solution

github-actions bot commented Oct 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joerunde Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

wallashss commented Oct 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njhill left a comment

Choose a reason for hiding this comment

wallashss commented Oct 2, 2024 •

edited

Loading

joerunde Oct 9, 2024 •

edited

Loading