You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I am currently hosting a Docker image with VLLM 0.6.4.post1 tied to a k8 pod. When attempting to make a request for completions, I receive the following trace
Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: INFO 11-16 15:18:47 logger.py:37] Received request cmpl-ad9b68a083ad4bb09522daf6d65744c0-0: prompt: 'Hello, this is a test.', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=20, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [9906, 11, 420, 374, 264, 1296, 13], lora_request: None, prompt_adapter_request: None. Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: INFO 11-16 15:18:47 engine.py:267] Added request cmpl-ad9b68a083ad4bb09522daf6d65744c0-0. Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: INFO 11-16 15:18:47 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241116-151847.pkl... Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: INFO 11-16 15:18:47 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241116-151847.pkl. Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: CRITICAL 11-16 15:18:47 launcher.py:99] MQLLMEngine is already dead, terminating server process Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: INFO: 127.0.0.1:59556 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] TypeError("CompilationError.__init__() missing 1 required positional argument: 'node'") Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] Traceback (most recent call last): Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/language/core.py", line 35, in wrapper Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return fn(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/language/core.py", line 1597, in load Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return semantic.load(pointer, mask, other, boundary_check, padding_option, cache_modifier, eviction_policy, Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/language/semantic.py", line 1037, in load Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return _load_legacy(ptr, mask, other, boundary_check, padding, cache, eviction, is_volatile, builder) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/language/semantic.py", line 1005, in _load_legacy Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] other = cast(other, elt_ty, builder) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/language/semantic.py", line 759, in cast Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] assert builder.options.allow_fp8e4nv, "fp8e4nv data type is not supported on CUDA arch < 89" Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] AssertionError: fp8e4nv data type is not supported on CUDA arch < 89 Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] The above exception was the direct cause of the following exception: Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] Traceback (most recent call last): Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return func(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] hidden_or_intermediate_states = model_executable( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return self._call_impl(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return forward_call(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 553, in forward Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] model_output = self.model(input_ids, positions, kv_caches, Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 143, in __call__ Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return self.forward(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 340, in forward Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] hidden_states, residual = layer(positions, hidden_states, Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return self._call_impl(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return forward_call(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 259, in forward Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] hidden_states = self.self_attn(positions=positions, Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return self._call_impl(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return forward_call(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 189, in forward Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] attn_output = self.attn(q, k, v, kv_cache, attn_metadata) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return self._call_impl(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return forward_call(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/attention/layer.py", line 99, in forward Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return self.impl.forward(query, Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/attention/backends/xformers.py", line 566, in forward Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] out = PagedAttention.forward_prefix( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/attention/ops/paged_attn.py", line 211, in forward_prefix Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] context_attention_fwd( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return func(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/attention/ops/prefix_prefill.py", line 811, in context_attention_fwd Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] _fwd_kernel[grid]( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda> Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 662, in run Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] kernel = self.compile( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/compiler/compiler.py", line 276, in compile Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] module = src.make_ir(options, codegen_fns, context) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/compiler/compiler.py", line 113, in make_ir Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] triton.compiler.errors.CompilationError: at 110:17: Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] cur_kv_head * stride_k_cache_h + Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] (offs_d[:, None] // x) * stride_k_cache_d + Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] ((start_n + offs_n[None, :]) % block_size) * Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] stride_k_cache_bl + Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] (offs_d[:, None] % x) * stride_k_cache_x) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] # [N,D] Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] off_v = ( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] bn[:, None] * stride_v_cache_bs + Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] cur_kv_head * stride_v_cache_h + Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] offs_d[None, :] * stride_v_cache_d + Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] k_load = tl.load(K_cache + off_k, Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] ^ Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] During handling of the above exception, another exception occurred: Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] Traceback (most recent call last): Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in start Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] self.run_engine_loop() Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 196, in run_engine_loop Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] request_outputs = self.engine_step() Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 214, in engine_step Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] raise e Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 205, in engine_step Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return self.engine.step() Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1454, in step Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] outputs = self.model_executor.execute_model( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 125, in execute_model Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] output = self.driver_worker.execute_model(execute_model_req) INFO: Shutting down INFO: Waiting for application shutdown. INFO: Application shutdown complete. INFO: Finished server process [1329] Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 343, in execute_model Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] output = self.model_runner.execute_model( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return func(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] raise type(err)( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] TypeError: CompilationError.__init__() missing 1 required positional argument: 'node' Download model after launching VLM server. Default Model Args are: --max-model-len 128000 --quantization marlin --gpu-memory-utilization 0.97 --trust-remote-code --enforce-eager --kv-cache-dtype fp8 Number of GPU's consumed: 1 NVIDIA visible device string: 1 CUDA visible device string: Starting vllm api server... Model: hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 LLM_LOG_LEVEL is set to Running checks to wait for Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 server to start... Could not resolve Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 server. Pausing 10s.
Curious as this does not happen on current production version 0.6.0.
Here is the following code invoking it
`import requests
import json
import sys
def test_vllm_pod(host="localhost", port=1453):
"""Basic test for VLLM pod connectivity"""
url = f"http://{host}:{port}/v1/completions"
# Minimal payload
payload = {
"prompt": "Say hello:",
"max_tokens": 10,
"temperature": 0,
"model": "hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4"
}
print(f"Testing VLLM pod at {url}")
print(f"Request payload:\n{json.dumps(payload, indent=2)}")
try:
response = requests.post(
url,
json=payload,
headers={"Content-Type": "application/json"},
timeout=30 # Increased timeout
)
print(f"\nResponse Status: {response.status_code}")
print(f"Response Headers:")
for k, v in response.headers.items():
print(f"{k}: {v}")
if response.status_code == 200:
try:
data = response.json()
print(f"\nResponse Data:\n{json.dumps(data, indent=2)}")
except json.JSONDecodeError:
print(f"\nRaw Response Text:\n{response.text}")
else:
print(f"\nError Response:\n{response.text}")
except requests.exceptions.ConnectionError:
print(f"Connection failed - verify port-forward to pod is active")
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--host", default="0.0.0.0")
parser.add_argument("--port", type=int, default=1453)
args = parser.parse_args()
test_vllm_pod(args.host, args.port)`
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello, I am currently hosting a Docker image with VLLM 0.6.4.post1 tied to a k8 pod. When attempting to make a request for completions, I receive the following trace
Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: INFO 11-16 15:18:47 logger.py:37] Received request cmpl-ad9b68a083ad4bb09522daf6d65744c0-0: prompt: 'Hello, this is a test.', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=20, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [9906, 11, 420, 374, 264, 1296, 13], lora_request: None, prompt_adapter_request: None. Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: INFO 11-16 15:18:47 engine.py:267] Added request cmpl-ad9b68a083ad4bb09522daf6d65744c0-0. Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: INFO 11-16 15:18:47 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241116-151847.pkl... Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: INFO 11-16 15:18:47 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241116-151847.pkl. Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: CRITICAL 11-16 15:18:47 launcher.py:99] MQLLMEngine is already dead, terminating server process Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: INFO: 127.0.0.1:59556 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] TypeError("CompilationError.__init__() missing 1 required positional argument: 'node'") Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] Traceback (most recent call last): Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/language/core.py", line 35, in wrapper Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return fn(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/language/core.py", line 1597, in load Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return semantic.load(pointer, mask, other, boundary_check, padding_option, cache_modifier, eviction_policy, Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/language/semantic.py", line 1037, in load Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return _load_legacy(ptr, mask, other, boundary_check, padding, cache, eviction, is_volatile, builder) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/language/semantic.py", line 1005, in _load_legacy Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] other = cast(other, elt_ty, builder) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/language/semantic.py", line 759, in cast Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] assert builder.options.allow_fp8e4nv, "fp8e4nv data type is not supported on CUDA arch < 89" Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] AssertionError: fp8e4nv data type is not supported on CUDA arch < 89 Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] The above exception was the direct cause of the following exception: Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] Traceback (most recent call last): Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return func(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] hidden_or_intermediate_states = model_executable( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return self._call_impl(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return forward_call(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 553, in forward Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] model_output = self.model(input_ids, positions, kv_caches, Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 143, in __call__ Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return self.forward(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 340, in forward Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] hidden_states, residual = layer(positions, hidden_states, Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return self._call_impl(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return forward_call(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 259, in forward Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] hidden_states = self.self_attn(positions=positions, Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return self._call_impl(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return forward_call(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 189, in forward Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] attn_output = self.attn(q, k, v, kv_cache, attn_metadata) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return self._call_impl(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return forward_call(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/attention/layer.py", line 99, in forward Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return self.impl.forward(query, Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/attention/backends/xformers.py", line 566, in forward Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] out = PagedAttention.forward_prefix( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/attention/ops/paged_attn.py", line 211, in forward_prefix Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] context_attention_fwd( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return func(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/attention/ops/prefix_prefill.py", line 811, in context_attention_fwd Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] _fwd_kernel[grid]( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda> Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 662, in run Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] kernel = self.compile( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/compiler/compiler.py", line 276, in compile Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] module = src.make_ir(options, codegen_fns, context) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/triton/compiler/compiler.py", line 113, in make_ir Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] triton.compiler.errors.CompilationError: at 110:17: Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] cur_kv_head * stride_k_cache_h + Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] (offs_d[:, None] // x) * stride_k_cache_d + Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] ((start_n + offs_n[None, :]) % block_size) * Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] stride_k_cache_bl + Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] (offs_d[:, None] % x) * stride_k_cache_x) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] # [N,D] Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] off_v = ( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] bn[:, None] * stride_v_cache_bs + Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] cur_kv_head * stride_v_cache_h + Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] offs_d[None, :] * stride_v_cache_d + Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] k_load = tl.load(K_cache + off_k, Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] ^ Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] During handling of the above exception, another exception occurred: Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] Traceback (most recent call last): Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in start Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] self.run_engine_loop() Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 196, in run_engine_loop Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] request_outputs = self.engine_step() Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 214, in engine_step Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] raise e Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 205, in engine_step Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return self.engine.step() Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1454, in step Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] outputs = self.model_executor.execute_model( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 125, in execute_model Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] output = self.driver_worker.execute_model(execute_model_req) INFO: Shutting down INFO: Waiting for application shutdown. INFO: Application shutdown complete. INFO: Finished server process [1329] Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 343, in execute_model Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] output = self.model_runner.execute_model( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] return func(*args, **kwargs) Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] File "/miniforge3/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] raise type(err)( Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 :: ERROR 11-16 15:18:47 engine.py:135] TypeError: CompilationError.__init__() missing 1 required positional argument: 'node' Download model after launching VLM server. Default Model Args are: --max-model-len 128000 --quantization marlin --gpu-memory-utilization 0.97 --trust-remote-code --enforce-eager --kv-cache-dtype fp8 Number of GPU's consumed: 1 NVIDIA visible device string: 1 CUDA visible device string: Starting vllm api server... Model: hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 LLM_LOG_LEVEL is set to Running checks to wait for Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 server to start... Could not resolve Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 server. Pausing 10s.
Curious as this does not happen on current production version 0.6.0.
Here is the following code invoking it
Beta Was this translation helpful? Give feedback.
All reactions