Why does vllm consume so much memory? #795

HouLingLXH · 2025-01-23T07:43:50Z

Jetson Orin NX 16G
jetpack 6.2
docker： dustynv/vllm:0.6.3-r36.4.0

llm = LLM(model="./Qwen2-7B-Instruct.Q4_K_M.gguf",max_model_len = 200,max_num_seqs=1)

Consume more than 16G of memory and terminate:
INFO 01-23 07:32:01 model_runner.py:1060] Starting to load model ./Qwen2-7B-Instruct.Q4_K_M.gguf...
/usr/local/lib/python3.10/dist-packages/torch/nested/init.py:226: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
INFO 01-23 07:32:42 model_runner.py:1071] Loading model weights took 4.3752 GB
INFO 01-23 07:34:26 gpu_executor.py:122] # GPU blocks: 11836, # CPU blocks: 4681
INFO 01-23 07:34:26 gpu_executor.py:126] Maximum concurrency for 200 tokens per request: 946.88x
Killed

By ollama , I can run qwen2.5-7b on the same device.

johnnynunez · 2025-01-23T08:08:06Z

Because each framework has differente behavior of how to process the model.
Execution Requirements: The vLLM framework uses additional memory to handle token generation, input/output management, and other tasks. Also, with vllm is all up to you, memory size, input tokens length etc

Also, you are using an old version of vllm. @dusty-nv This docker also must be update because with 0.6.3.post1 we have natively vllm on jetson and it has optimizations for unified memory

HouLingLXH · 2025-01-23T08:43:41Z

Is 16G enough to run Qwen2-7B-Instruct.Q4_K_M.gguf in Jetson Orin NX?

Can not run it in dustynv/vllm:0.6.6.post1-r36.4.0 too.

dusty-nv · 2025-01-23T15:50:38Z

I will be rebuilding vLLM, yes. And often in "server mode", the default setting may be to pre-allocate more memory than necessary.

…

________________________________ From: Johnny ***@***.***> Sent: Thursday, January 23, 2025 3:08 AM To: dusty-nv/jetson-containers ***@***.***> Cc: Dustin Franklin ***@***.***>; Mention ***@***.***> Subject: Re: [dusty-nv/jetson-containers] Why does vllm consume so much memory? (Issue #795) Because each framework has differente behavior of how to process the model. Execution Requirements: The vLLM framework uses additional memory to handle token generation, input/output management, and other tasks. Also, with vllm is all up to you, memory size, input tokens length etc Also, you are using an old version of vllm. @dusty-nv<https://github.com/dusty-nv> This docker also must be update because with 0.6.3.post1 we have natively vllm on jetson and it has optimizations for unified memory — Reply to this email directly, view it on GitHub<#795 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADVEGK4E6XSDUANK3Y6HVDT2MCPPXAVCNFSM6AAAAABVWV6JIGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBZGEYTCMRSGA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

leon-seidel · 2025-01-28T12:12:51Z

I would recommend to add --swap_space 0 in the command, otherwise vLLM will allocate another 4 GB of unified memory. Adding --enforce-eager helps with some models, too. You might also get further speedups with using models quantized with llm-compressor, which should be more optimized than GGUFs.

makoit mentioned this issue Feb 2, 2025

Eventually exchange ollama with vllm and use https://www.nvidia.com/en-us/ai/ for Nvidia hse-digital-engineering/build-your-own-chatbot#21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does vllm consume so much memory? #795

Why does vllm consume so much memory? #795

HouLingLXH commented Jan 23, 2025 •

edited

Loading

johnnynunez commented Jan 23, 2025

HouLingLXH commented Jan 23, 2025 •

edited

Loading

dusty-nv commented Jan 23, 2025 via email

leon-seidel commented Jan 28, 2025 •

edited

Loading

Why does vllm consume so much memory? #795

Why does vllm consume so much memory? #795

Comments

HouLingLXH commented Jan 23, 2025 • edited Loading

johnnynunez commented Jan 23, 2025

HouLingLXH commented Jan 23, 2025 • edited Loading

dusty-nv commented Jan 23, 2025 via email

leon-seidel commented Jan 28, 2025 • edited Loading

HouLingLXH commented Jan 23, 2025 •

edited

Loading

HouLingLXH commented Jan 23, 2025 •

edited

Loading

leon-seidel commented Jan 28, 2025 •

edited

Loading