Default vLLM needs to allocate for the full KVCache #9525

sbaby171 · 2024-10-19T06:57:45Z

sbaby171
Oct 19, 2024

I thought vLLMs dynamic mgmt of memory would allow the GPU memory to handle as much KVCache as it could aka Growing the KV Cache one page at a time as the KVCache grows and constantly deleting finished queries and reallocating memory.

However, when putting Llama-3.1-8b on a 24G 4090, it errors out that not enough space is available for a FULL 128k context length.

I am effectively running the vLLM offline inference example but with Llama-3.1-8B (https://docs.vllm.ai/en/v0.5.5/getting_started/examples/offline_inference.html)

What am I miss understanding here?

INFO 10-18 23:45:44 model_runner.py:1025] Loading model weights took 14.9888 GB
INFO 10-18 23:45:44 gpu_executor.py:122] # GPU blocks: 2502, # CPU blocks: 2048
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/msbabo/code/vllm/vllm_examples/offline_inference.py", line 29, in <module>
[rank0]:     main(model = args.model)
[rank0]:   File "/home/msbabo/code/vllm/vllm_examples/offline_inference.py", line 12, in main
[rank0]:     llm = LLM(model=model)
[rank0]:   File "/home/msbabo/code/vllm/VENV/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 214, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:   File "/home/msbabo/code/vllm/VENV/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 564, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/home/msbabo/code/vllm/VENV/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 339, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/msbabo/code/vllm/VENV/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 487, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/msbabo/code/vllm/VENV/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 125, in initialize_cache
[rank0]:     self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/msbabo/code/vllm/VENV/lib/python3.10/site-packages/vllm/worker/worker.py", line 258, in initialize_cache
[rank0]:     raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]:   File "/home/msbabo/code/vllm/VENV/lib/python3.10/site-packages/vllm/worker/worker.py", line 483, in raise_if_cache_size_invalid
[rank0]:     raise ValueError(
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (40032). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

ghost · 2024-10-19T07:19:30Z

ghost
Oct 19, 2024

5 good looking call

0 replies

jikunshang · 2024-11-05T04:59:07Z

jikunshang
Nov 5, 2024

it's a hard check, you can specify --max-model-len to avoid this error. by default it will use the value form model config.
in your case, you may have 12GB free gpu memory after model loads, and 12GB may only support 100k tokens, if you really need 128k context length on this model, your device cannot run.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default vLLM needs to allocate for the full KVCache #9525

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Default vLLM needs to allocate for the full KVCache #9525

sbaby171 Oct 19, 2024

Replies: 2 comments

ghost Oct 19, 2024

jikunshang Nov 5, 2024

sbaby171
Oct 19, 2024

ghost
Oct 19, 2024

jikunshang
Nov 5, 2024