Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Don't log duplicate error stacktrace for every request in the batch #9023

Merged
merged 4 commits into from
Oct 21, 2024

Conversation

wallashss
Copy link
Contributor

@wallashss wallashss commented Oct 2, 2024

Don't log duplicate error stacktrace for every request in the batch

EDIT: Discussed with @joerunde, and we changed the solution. On the client.py if the engine is errored, then the server should shut down, to do that we have to make sure to send a MQEngineDeadError despite the type of the exception. Previously, the server only shut down if the exception is a RuntimeError, and for those cases it does not log replicated stacktrace. With that we removed the exception MQEngineBatchError, because does not make sense anymore and we achieved the goal of this PR, because on launcher there is already an exception handler that does not replicate logs when it catches MQEngineBatchError.


Currently if there is an error in the engine when processing the current batch, the whole stacktrace ends up getting logged for every request. This PR address this issue to improve the server log readability.

Steps to reproduce

I created a script that will keep sending requests to the server in parallel to make the engine busy and batching multiple requests.

import requests 
import json
from multiprocessing import Pool

prompts = ["how to make cheesecake", "How to make pizza", "Who is afraid of the big bad wolf?", "Who is the president of Brazil", "What is a capital of Spain", "Who is the president of USA"]


def do_generate(idx):
    
    while True:
        
        data = {
            'model': "ibm/merlinite-7b",
            'prompt': [prompts[idx]],
            'max_tokens': 200,
            "temperature": 0,
            "stream": False
        }

        res = requests.post('http://localhost:8000/v1/completions', data=json.dumps(data), headers={"Content-Type": "application/json"})

        print(json.dumps(res.json(), indent=2) )

if __name__ == "__main__": 
    indices = [1, 2, 3, 4, 5, 6]
    with Pool(len(indices)) as p:
        p.map(do_generate, indices)
    

On the vLLM side I added a hardcoded check to force an exception (not sure if there is a better way to do that). Basically, if it receives a request with the max_tokens==123 then it raises an exception.

@torch.inference_mode()
    @dump_input_when_exception(exclude_args=[0], exclude_kwargs=["self"])
    def execute_model(
        self,
        model_input: ModelInputForGPUWithSamplingMetadata,
        kv_caches: List[torch.Tensor],
        intermediate_tensors: Optional[IntermediateTensors] = None,
        num_steps: int = 1,
    ) -> Optional[Union[List[SamplerOutput], IntermediateTensors]]:
        
        for g in model_input.sampling_metadata.seq_groups:
            if g.sampling_params.max_tokens == 123: 
                raise Exception("FORCED EXCEPTION")

Then, just send a poisoned curl with max_tokens==123 to make the server crash.

curl http://localhost:8000/v1/completions -H "Content-Type: application/json"   -d '{
    "model": "ibm/merlinite-7b",
    "prompt": ["How to make pizza"],
    "max_tokens": 123,
    "temperature": 0
  }'

Following the log of the server for this scenario:

Server log
INFO 10-01 18:43:01 api_server.py:520] vLLM API server version 0.6.1.post2
INFO 10-01 18:43:01 api_server.py:521] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/shared_model_storage/transformers_cache/models--ibm--merlinite-7b/snapshots/233d12759d5bb9344231dafdb51310ec19d79c0e/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
[...]
INFO 10-01 18:43:27 logger.py:36] Received request cmpl-c52dbd9a95114d1cbfa182e588138c86-0: prompt: 'How to make pizza', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1602, 298, 1038, 20727], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:27 engine.py:255] Added request cmpl-c52dbd9a95114d1cbfa182e588138c86-0.
INFO 10-01 18:43:27 logger.py:36] Received request cmpl-cf480cd01b3f45f9a44ee90f910b6dc7-0: prompt: 'Who is afraid of the big bad wolf?', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 8526, 302, 272, 2032, 2607, 24100, 28804], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:27 logger.py:36] Received request cmpl-a81e48bee0294b0b84cb89766e3747b1-0: prompt: 'Who is the president of Brazil', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 13250], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:27 logger.py:36] Received request cmpl-91fe755b3d0248ab94b2671eae5be62b-0: prompt: 'What is a capital of Spain', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1824, 349, 264, 5565, 302, 12567], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:27 logger.py:36] Received request cmpl-62a2bd68b2514c3997280fd68ae3bf38-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:27 metrics.py:351] Avg prompt throughput: 0.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 10-01 18:43:27 engine.py:255] Added request cmpl-cf480cd01b3f45f9a44ee90f910b6dc7-0.
INFO 10-01 18:43:27 engine.py:255] Added request cmpl-a81e48bee0294b0b84cb89766e3747b1-0.
INFO 10-01 18:43:27 engine.py:255] Added request cmpl-91fe755b3d0248ab94b2671eae5be62b-0.
INFO 10-01 18:43:27 engine.py:255] Added request cmpl-62a2bd68b2514c3997280fd68ae3bf38-0.
INFO:     ::1:51364 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:28 logger.py:36] Received request cmpl-c5044c83c77b430d9038ecd2b7d1089e-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:28 engine.py:255] Added request cmpl-c5044c83c77b430d9038ecd2b7d1089e-0.
INFO:     ::1:51372 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:28 logger.py:36] Received request cmpl-3af6a00b38db4987a36242d859e5cf93-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:28 engine.py:255] Added request cmpl-3af6a00b38db4987a36242d859e5cf93-0.
INFO:     ::1:51374 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:29 logger.py:36] Received request cmpl-c9c4172a7de346d7ac1d38b3c2ef8446-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:29 engine.py:255] Added request cmpl-c9c4172a7de346d7ac1d38b3c2ef8446-0.
INFO:     ::1:51380 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:29 logger.py:36] Received request cmpl-dbc219febf5f4375ae74c0eed37c30a0-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:29 engine.py:255] Added request cmpl-dbc219febf5f4375ae74c0eed37c30a0-0.
INFO:     ::1:51388 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:30 logger.py:36] Received request cmpl-97b0729b39ac464e98518e9773d42fff-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:30 engine.py:255] Added request cmpl-97b0729b39ac464e98518e9773d42fff-0.
INFO:     ::1:51402 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:30 logger.py:36] Received request cmpl-21dd2f2709064410bcd6c66eda41d1da-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:30 engine.py:255] Added request cmpl-21dd2f2709064410bcd6c66eda41d1da-0.
INFO:     ::1:51412 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:31 logger.py:36] Received request cmpl-6837e12c5c3c4ac5aaf6bbe816dd2a69-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:31 engine.py:255] Added request cmpl-6837e12c5c3c4ac5aaf6bbe816dd2a69-0.
INFO:     ::1:51426 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-01 18:43:31 logger.py:36] Received request cmpl-69faaae7ba6b48e58655c07a2e7ba046-0: prompt: 'How to make pizza', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=123, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1602, 298, 1038, 20727], lora_request: None, prompt_adapter_request: None.
INFO 10-01 18:43:31 engine.py:255] Added request cmpl-69faaae7ba6b48e58655c07a2e7ba046-0.
INFO 10-01 18:43:31 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241001-184331.pkl...
INFO 10-01 18:43:31 model_runner_base.py:141] Completed writing input of failed execution to /tmp/err_execute_model_input_20241001-184331.pkl.
INFO:     ::1:51348 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-01 18:43:31 engine.py:130] Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241001-184331.pkl): FORCED EXCEPTION')
ERROR 10-01 18:43:31 engine.py:130] Traceback (most recent call last):
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 10-01 18:43:31 engine.py:130]     return func(*args, **kwargs)
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/worker/model_runner.py", line 1548, in execute_model
ERROR 10-01 18:43:31 engine.py:130]     raise Exception("FORCED EXCEPTION")
ERROR 10-01 18:43:31 engine.py:130] Exception: FORCED EXCEPTION
ERROR 10-01 18:43:31 engine.py:130] 
ERROR 10-01 18:43:31 engine.py:130] The above exception was the direct cause of the following exception:
ERROR 10-01 18:43:31 engine.py:130] 
ERROR 10-01 18:43:31 engine.py:130] Traceback (most recent call last):
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 128, in start
ERROR 10-01 18:43:31 engine.py:130]     self.run_engine_loop()
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 188, in run_engine_loop
ERROR 10-01 18:43:31 engine.py:130]     request_outputs = self.engine_step()
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 207, in engine_step
ERROR 10-01 18:43:31 engine.py:130]     raise e
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 198, in engine_step
ERROR 10-01 18:43:31 engine.py:130]     return self.engine.step()
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/engine/llm_engine.py", line 1228, in step
ERROR 10-01 18:43:31 engine.py:130]     outputs = self.model_executor.execute_model(
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/executor/gpu_executor.py", line 130, in execute_model
ERROR 10-01 18:43:31 engine.py:130]     output = self.driver_worker.execute_model(execute_model_req)
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 10-01 18:43:31 engine.py:130]     output = self.model_runner.execute_model(
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/site-packages/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-01 18:43:31 engine.py:130]     return func(*args, **kwargs)
ERROR 10-01 18:43:31 engine.py:130]   File "/tmp/vllm/vllm/worker/model_runner_base.py", line 144, in _wrapper
ERROR 10-01 18:43:31 engine.py:130]     raise type(err)(
ERROR 10-01 18:43:31 engine.py:130] Exception: Error in model execution (input dumped to /tmp/err_execute_model_input_20241001-184331.pkl): FORCED EXCEPTION
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/tmp/vllm/vllm/entrypoints/openai/api_server.py", line 327, in create_completion
    generator = await completion(raw_request).create_completion(
  File "/tmp/vllm/vllm/entrypoints/openai/serving_completion.py", line 178, in create_completion
    async for i, res in result_generator:
  File "/tmp/vllm/vllm/utils.py", line 488, in merge_async_iterators
    item = await d
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
Exception: Error in model execution (input dumped to /tmp/err_execute_model_input_20241001-184331.pkl): FORCED EXCEPTION
INFO:     ::1:51352 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/tmp/vllm/vllm/entrypoints/openai/api_server.py", line 327, in create_completion
    generator = await completion(raw_request).create_completion(
  File "/tmp/vllm/vllm/entrypoints/openai/serving_completion.py", line 178, in create_completion
    async for i, res in result_generator:
  File "/tmp/vllm/vllm/utils.py", line 488, in merge_async_iterators
    item = await d
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
Exception: Error in model execution (input dumped to /tmp/err_execute_model_input_20241001-184331.pkl): FORCED EXCEPTION
INFO:     ::1:51356 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/tmp/vllm/vllm/entrypoints/openai/api_server.py", line 327, in create_completion
    generator = await completion(raw_request).create_completion(
  File "/tmp/vllm/vllm/entrypoints/openai/serving_completion.py", line 178, in create_completion
    async for i, res in result_generator:
  File "/tmp/vllm/vllm/utils.py", line 488, in merge_async_iterators
    item = await d
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
Exception: Error in model execution (input dumped to /tmp/err_execute_model_input_20241001-184331.pkl): FORCED EXCEPTION
INFO:     ::1:51354 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/tmp/vllm/vllm/entrypoints/openai/api_server.py", line 327, in create_completion
    generator = await completion(raw_request).create_completion(
  File "/tmp/vllm/vllm/entrypoints/openai/serving_completion.py", line 178, in create_completion
    async for i, res in result_generator:
  File "/tmp/vllm/vllm/utils.py", line 488, in merge_async_iterators
    item = await d
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  [Previous line repeated 1 more time]
Exception: Error in model execution (input dumped to /tmp/err_execute_model_input_20241001-184331.pkl): FORCED EXCEPTION
INFO:     ::1:51438 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/tmp/site-packages/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/tmp/site-packages/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/tmp/site-packages/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/tmp/vllm/vllm/entrypoints/openai/api_server.py", line 327, in create_completion
    generator = await completion(raw_request).create_completion(
  File "/tmp/vllm/vllm/entrypoints/openai/serving_completion.py", line 178, in create_completion
    async for i, res in result_generator:
  File "/tmp/vllm/vllm/utils.py", line 488, in merge_async_iterators
    item = await d
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  File "/tmp/vllm/vllm/engine/multiprocessing/client.py", line 492, in _process_request
    raise request_output
  [Previous line repeated 2 more times]
Exception: Error in model execution (input dumped to /tmp/err_execute_model_input_20241001-184331.pkl): FORCED EXCEPTION
CRITICAL 10-01 18:43:31 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     ::1:51454 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [62660]

Solution

I created a new custom exception: MQEngineBatchError that raises when the engine has error while processing a batch. When this exception is propagated to the HTTP the server, before uvicorn/fastapi tries to log it and cause the log pollution, a custom exception handler will only print a line to inform that was an error on batch of this request. Still have a print for each request, because we have to at least give a feedback why this request failed, however the full stacktrace will only be logged once as we can see in this snippet:

#vllm/engine/multiprocessing/engine.py
def start(self):
    try:
        try:
            logger.debug("Starting Startup Loop.")
            self.run_startup_loop()
            logger.debug("Starting heartbeat thread")
            self.heartbeat_thread.start()
            logger.debug("Starting Engine Loop.")
            self.run_engine_loop()
        except Exception as e:
            logger.exception(repr(e)) # HERE: Log the exception for the first time

If this change we can have an output log as follow:

New Server log
INFO 10-02 13:00:19 api_server.py:520] vLLM API server version 0.6.1.post2
INFO 10-02 13:00:19 api_server.py:521] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/shared_model_storage/transformers_cache/models--ibm--merlinite-7b/snapshots/233d12759d5bb9344231dafdb51310ec19d79c0e/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
[...]
INFO 10-02 13:00:42 logger.py:36] Received request cmpl-2cda52baff774a22ae96289bfe6633cf-0: prompt: 'How to make pizza', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1602, 298, 1038, 20727], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:42 engine.py:255] Added request cmpl-2cda52baff774a22ae96289bfe6633cf-0.
INFO 10-02 13:00:42 logger.py:36] Received request cmpl-3ab8b1fe7d3946de8143ce0ae9ce9652-0: prompt: 'Who is afraid of the big bad wolf?', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 8526, 302, 272, 2032, 2607, 24100, 28804], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:42 logger.py:36] Received request cmpl-71db974fad6f4d9ebd9b449c17b3c033-0: prompt: 'What is a capital of Spain', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1824, 349, 264, 5565, 302, 12567], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:42 logger.py:36] Received request cmpl-01665a5bcabd456e967097e5ad5572b2-0: prompt: 'Who is the president of Brazil', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 13250], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:42 logger.py:36] Received request cmpl-8d0708b726264b05be87395de88f1fc6-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:42 metrics.py:351] Avg prompt throughput: 0.5 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 10-02 13:00:42 engine.py:255] Added request cmpl-3ab8b1fe7d3946de8143ce0ae9ce9652-0.
INFO 10-02 13:00:42 engine.py:255] Added request cmpl-71db974fad6f4d9ebd9b449c17b3c033-0.
INFO 10-02 13:00:42 engine.py:255] Added request cmpl-01665a5bcabd456e967097e5ad5572b2-0.
INFO 10-02 13:00:42 engine.py:255] Added request cmpl-8d0708b726264b05be87395de88f1fc6-0.
INFO:     ::1:36138 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-02 13:00:43 logger.py:36] Received request cmpl-f49daa3545254a1080057dee3495b058-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:43 engine.py:255] Added request cmpl-f49daa3545254a1080057dee3495b058-0.
INFO:     ::1:41382 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-02 13:00:44 logger.py:36] Received request cmpl-4263620a4dad440ba4b6c5657557390d-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:44 engine.py:255] Added request cmpl-4263620a4dad440ba4b6c5657557390d-0.
INFO:     ::1:41390 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-02 13:00:44 logger.py:36] Received request cmpl-0d82c4a335be43eaa03d739410192c22-0: prompt: 'Who is the president of us', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [6526, 349, 272, 4951, 302, 592], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:44 engine.py:255] Added request cmpl-0d82c4a335be43eaa03d739410192c22-0.
INFO:     ::1:41392 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 10-02 13:00:45 logger.py:36] Received request cmpl-9740702ee25c4380861024255ea8b5f6-0: prompt: 'How to make pizza', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=123, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1602, 298, 1038, 20727], lora_request: None, prompt_adapter_request: None.
INFO 10-02 13:00:45 engine.py:255] Added request cmpl-9740702ee25c4380861024255ea8b5f6-0.
INFO 10-02 13:00:45 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241002-130045.pkl...
INFO 10-02 13:00:45 model_runner_base.py:141] Completed writing input of failed execution to /tmp/err_execute_model_input_20241002-130045.pkl.
ERROR 10-02 13:00:45 launcher.py:111] MQEngineBatchError("A batch generation failed. Inspect the stacktrace to find the original error: Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')")
INFO:     ::1:36130 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-02 13:00:45 launcher.py:111] MQEngineBatchError("A batch generation failed. Inspect the stacktrace to find the original error: Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')")
INFO:     ::1:36132 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-02 13:00:45 engine.py:130] Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')
ERROR 10-02 13:00:45 engine.py:130] Traceback (most recent call last):
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 10-02 13:00:45 engine.py:130]     return func(*args, **kwargs)
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/worker/model_runner.py", line 1549, in execute_model
ERROR 10-02 13:00:45 engine.py:130]     raise Exception("FORCED EXCEPTION")
ERROR 10-02 13:00:45 engine.py:130] Exception: FORCED EXCEPTION
ERROR 10-02 13:00:45 engine.py:130] 
ERROR 10-02 13:00:45 engine.py:130] The above exception was the direct cause of the following exception:
ERROR 10-02 13:00:45 engine.py:130] 
ERROR 10-02 13:00:45 engine.py:130] Traceback (most recent call last):
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 128, in start
ERROR 10-02 13:00:45 engine.py:130]     self.run_engine_loop()
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 188, in run_engine_loop
ERROR 10-02 13:00:45 engine.py:130]     request_outputs = self.engine_step()
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 207, in engine_step
ERROR 10-02 13:00:45 engine.py:130]     raise e
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/engine/multiprocessing/engine.py", line 198, in engine_step
ERROR 10-02 13:00:45 engine.py:130]     return self.engine.step()
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/engine/llm_engine.py", line 1228, in step
ERROR 10-02 13:00:45 engine.py:130]     outputs = self.model_executor.execute_model(
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/executor/gpu_executor.py", line 130, in execute_model
ERROR 10-02 13:00:45 engine.py:130]     output = self.driver_worker.execute_model(execute_model_req)
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 10-02 13:00:45 engine.py:130]     output = self.model_runner.execute_model(
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/env2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-02 13:00:45 engine.py:130]     return func(*args, **kwargs)
ERROR 10-02 13:00:45 engine.py:130]   File "/tmp/vllm/vllm/worker/model_runner_base.py", line 144, in _wrapper
ERROR 10-02 13:00:45 engine.py:130]     raise type(err)(
ERROR 10-02 13:00:45 engine.py:130] Exception: Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION
ERROR 10-02 13:00:45 launcher.py:111] MQEngineBatchError("A batch generation failed. Inspect the stacktrace to find the original error: Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')")
INFO:     ::1:36152 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-02 13:00:45 launcher.py:111] MQEngineBatchError("A batch generation failed. Inspect the stacktrace to find the original error: Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')")
INFO:     ::1:36136 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-02 13:00:45 launcher.py:111] MQEngineBatchError("A batch generation failed. Inspect the stacktrace to find the original error: Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')")
INFO:     ::1:41404 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 10-02 13:00:45 launcher.py:100] MQLLMEngine is already dead, terminating server process
INFO:     ::1:41408 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [64354]

Copy link

github-actions bot commented Oct 2, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@wallashss wallashss changed the title [Frontend] Added MQEngineBatchError to improve stacktrace readability [Frontend] Don't log duplicate error stacktrace for every request in the batch Oct 2, 2024
@wallashss wallashss force-pushed the dont_duplicate_err branch 3 times, most recently from da442ba to 0b18f78 Compare October 3, 2024 13:23
@@ -164,7 +164,7 @@ async def test_failed_abort(tmp_socket):
sampling_params=SamplingParams(max_tokens=10),
request_id=uuid.uuid4()):
pass
assert "KeyError" in repr(execinfo.value)
assert "MQEngineDeadError" in repr(execinfo.value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change might break the logic for this test?

IIUC, what was being done previously was that a KeyError was being raised in the engine, which then caused an MQEngineDeadError. The test then checks that the original KeyError is still referenced in the raised MQEngineDeadError.

I think it's still important for the original error to be surfaced once, is this change here intentional as part of not repeating the stack trace a bunch of times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh Wait... That is not what I mean neither. My intention was to put a MQEngineBatchError. Did a quick local fix, and it worked as I excpected as well.

I think it's still important for the original error to be surfaced once, is this change here intentional as part of not repeating the stack trace a bunch of times?

Totally agree. But thinking the system as whole, this error is logged at least once in this snippet:

#repo/vllm/engine/multiprocessing/engine.py
    def start(self):
        try:
            try:
                logger.debug("Starting Startup Loop.")
                self.run_startup_loop()
                logger.debug("Starting heartbeat thread")
                self.heartbeat_thread.start()
                logger.debug("Starting Engine Loop.")
                self.run_engine_loop()
            except Exception as e:
                logger.exception(repr(e))

is this change here intentional as part of not repeating the stack trace a bunch of times?

The challenge here from the client side, previously when there is an error (in the batch) all requests receives the same exception and after propagate this exception to the server layer where they are logged a lot. Therefore, you request might receive an exception informing KeyError which is nothing related to it, but an exception raised by another test. That's the why, for now, I think might make sense change this test here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah I gotcha. We don't want to log the full stack trace here for the original RAISED_ERROR since it was already logged when that error actually happened. However, looking at the example server logs you posted, it does look like we are at least keeping the string repr of the original exception on the propagated exception here:

ERROR 10-02 13:00:45 launcher.py:111] MQEngineBatchError("A batch generation failed. Inspect the stacktrace to find the original error: Exception('Error in model execution (input dumped to /tmp/err_execute_model_input_20241002-130045.pkl): FORCED EXCEPTION')")

(the FORCED EXCEPTION is there from your original exception)

I probably would not change the RAISED_ERROR here to be a MQEngineBatchError, because this is the error that we are making the underlying LLMEngine raise, and it will never do that. I think instead here we should keep a different error (KeyError is probably still fine) and check that the message from it makes it in here.

This should work, right?

RAISED_ERROR = KeyError("foo")

...

assert "foo" in repr(execinfo.value)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updated.

I went back to the original code:

RAISED_ERROR = KeyError

I did that before because there are some lines that we have:

with pytest.raises(RAISED_ERROR):

After carefully reading the tests again, I actually had to change only few lines (in other places) to preserve the original tests. I think now makes more sense, thank you for help me realize that!

This should work, right?

RAISED_ERROR = KeyError("foo")

...

assert "foo" in repr(execinfo.value)

Almost, I printed this test and I got:

Engine loop is not running. Inspect the stacktrace to find the original error: KeyError().

In this test this KeyError is raised from a request that does no exist:

# Trigger an abort on the client side.
# This request ID does not exist, and will cause the engine to error
await client.abort(request_id="foo")

Not sure why I had to change before. I reverted and the tests run fine.

"stacktrace to find the original error: "
f"{repr(exception)}")
# If it is a runtime exception, we assume that
# the engine is already dead, let's pass this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think maybe I see- we're trying to send MQEngineDeadError instead of MQEngineBatchError if the engine is already dead... 🤔

I'm not sure that checking for RuntimeError is the most robust, we should be able to instead check if self.errored

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I broke a test before because of this RuntimeError. There is already an exception handler that expect a RuntimeError and it will shutdown the server next (In this case it already keep the behavior that we expect in this PR: without duplicated error stacktrace).

# repo/vllm/entrypoints/launcher.py
@app.exception_handler(RuntimeError)
    async def runtime_error_handler(request: Request, __):
        """On generic runtime error, check to see if the engine has died.
        It probably has, in which case the server will no longer be able to
        handle requests. Trigger a graceful shutdown with a SIGTERM."""
        engine = request.app.state.engine_client
        if (not envs.VLLM_KEEP_ALIVE_ON_ENGINE_DEATH and engine.errored
                and not engine.is_running):
            logger.fatal("AsyncLLMEngine has failed, terminating server "
                         "process")
            # See discussions here on shutting down a uvicorn server
            # https://github.com/encode/uvicorn/discussions/1103
            # In this case we cannot await the server shutdown here because
            # this handler must first return to close the connection for
            # this request.
            server.should_exit = True

So, in this PR one of my contribution was to add a flow that handle Exception that are not RuntimeErrors, which cause the log with the stacktrace printed multiple times.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the intent behind the extra RuntimeError there was to try to catch anything that might have killed the engine, but did not raise an EngineDeadError, which ideally should not happen. It's kinda equivalent to

try:
    handle_request(req)
except EngineDeadError:
    engine_dead_handler(req)
except RuntimeError:
    runtime_error_handler(req)

Maybe that should actually be changed to @app.exception_handler(Exception) to catch everything? Would that solve this problem entirely by just doing that? I had originally written these handlers, and I think I may have just made a bad assumption that unexpected things would be RuntimeErrors, which is probably often true but not always true. (It's not true for your test case where you explicitly raised Exception 😄 )

Then regardless of the server handling, I do think the error handling here in the engine should be more robust than just checking for RuntimeError. If we want special logic based on whether the engine is already dead, then I think we can do:

batch_error = MQEngineDeadError(msg) if self.errored else MQEngineBatchError(msg)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that should actually be changed to @app.exception_handler(Exception) to catch everything?

Sort of, I added a custom for MQEngineBatchError because we are sure that the engine had an error and it was logged before. But, what if we got an unhandled exception not related to the engine? I think in those case let the exception go up to the server and let FastAPI/Uvicorn do their thing.

batch_error = MQEngineDeadError(msg) if self.errored else MQEngineBatchError(msg)

Today, in practice any exception will eventually raise a MQEngineDeadError.

    # Set to error state only on engine critical error
    # (and record only the first one)
    if is_engine_errored and not self._errored_with:
        self._errored_with = exception

Honestly, I don't know the best solution, I wrote my solution based on what I read the code and tried to keep the old the behavior. I found two scenarios: i) runtime exception, shutdown server immediately; ii) other exceptions, set the engine erroed, when another request arrive then throws MQEngineDeadError and shutdown the server next. A simplification would be treat everything as runtime errors...

I introduced MQEngineBatchError thinking that in the future vLLM would be more robust to those errors, raise an error and keep up and running, but maybe it is a too soon feature. I would like hear more opinions on that.

Copy link
Collaborator

@joerunde joerunde Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the clarification @wallashss! This is definitely a bit confusing to hold it all in my head 😅

But, what if we got an unhandled exception not related to the engine? I think in those case let the exception go up to the server and let FastAPI/Uvicorn do their thing.

That's a good distinction, yeah. I'm okay with leaving stack traces in for things that we don't think were caused by an exception that killed the engine.

ii) other exceptions, set the engine erroed, when another request arrive then throws MQEngineDeadError and shutdown the server next. A simplification would be treat everything as runtime errors...

Oh, so actually I don't think we want to wait for the second request to come in to throw MQEngineDeadError, we should probably throw that immediately so that the server shuts down asap when it can no longer process requests

I introduced MQEngineBatchError thinking that in the future vLLM would be more robust to those errors, raise an error and keep up and running, but maybe it is a too soon feature

Yeah, it's a nice touch but I don't think we currently have the robustness to use it, like you say any exception will kill the engine :(

This is super close, I think I just have these two comments left then:

  1. I'd rather not include the new error type MQEngineBatchError since we don't yet have a case where an exception is raised but the engine stays alive. (My care amount on this is low)
  2. I want to make sure we're always killing the engine and server ASAP once it errors. (My care amount on this is high). I think this is already mostly covered by your code, we'll just want to make sure that we don't run into the case you mentioned about raising one error first and then waiting for the next request to raise an MQEngineDeadError. Maybe a catch-all for this could be like
@app.exception_handler(Exception)
async def handler(r, e):
    if engine.errored:
        # log just the exception message
        # shut down server
    else:
        # log full stack trace

But I think that's very ham-fisted and probably not necessary 😉

@wallashss
Copy link
Contributor Author

Thanks for the review @joerunde! See my comments and check if it make sense. I am totally open to review ideas and change the implementation.

wallashss added a commit to wallashss/vllm that referenced this pull request Oct 8, 2024
wallashss added a commit to wallashss/vllm that referenced this pull request Oct 8, 2024
# exception for a batch, and we may not know the
# request that caused it, neither if it was actually
# caused by any of them (e.g. CUDA OOM). Therefore we
# broadcast the same exception for all requests.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, love the explanation here!

assert client.errored

# Engine is errored, should get ENGINE_DEAD_ERROR.
# Throws an error that should get ENGINE_DEAD_ERROR.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also check a "batch" of requests here, like

def do_generate(client):
    async for _ in client.generate(prompt="Hello my name is",
                                   sampling_params=SamplingParams(),
                                   request_id=uuid.uuid4()):
        pass

...
# (in this test)
tasks = [asyncio.create_task(do_generate(client)) for _ in range(10)]

# Check that every `task` in `tasks` failed with `MQEngineDeadError`

That should test that we don't get the big spew of stack traces, since every request will raise an error type that doesn't log the stack trace

Signed-off-by: Wallas Santos <[email protected]>
# should get the same exception as a MQEngineDeadError.
errors = await asyncio.gather(*tasks, return_exceptions=True)
for e in errors:
assert "KeyError" in repr(e)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wallashss I think we need to assert that these errors are also MQEngineDeadErrors here, then we're good to go

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 17, 2024
@njhill njhill merged commit 711f3a7 into vllm-project:main Oct 21, 2024
59 checks passed
charlifu pushed a commit to charlifu/vllm that referenced this pull request Oct 23, 2024
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Oct 23, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
MErkinSag pushed a commit to MErkinSag/vllm that referenced this pull request Oct 26, 2024
garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024
FerdinandZhong pushed a commit to FerdinandZhong/vllm that referenced this pull request Oct 29, 2024
sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024
tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024
…the batch (vllm-project#9023)

Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants