Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does vllm consume so much memory? #795

Open
HouLingLXH opened this issue Jan 23, 2025 · 4 comments
Open

Why does vllm consume so much memory? #795

HouLingLXH opened this issue Jan 23, 2025 · 4 comments

Comments

@HouLingLXH
Copy link

HouLingLXH commented Jan 23, 2025

Jetson Orin NX 16G
jetpack 6.2
docker: dustynv/vllm:0.6.3-r36.4.0

llm = LLM(model="./Qwen2-7B-Instruct.Q4_K_M.gguf",max_model_len = 200,max_num_seqs=1)

Consume more than 16G of memory and terminate:
INFO 01-23 07:32:01 model_runner.py:1060] Starting to load model ./Qwen2-7B-Instruct.Q4_K_M.gguf...
/usr/local/lib/python3.10/dist-packages/torch/nested/init.py:226: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
INFO 01-23 07:32:42 model_runner.py:1071] Loading model weights took 4.3752 GB
INFO 01-23 07:34:26 gpu_executor.py:122] # GPU blocks: 11836, # CPU blocks: 4681
INFO 01-23 07:34:26 gpu_executor.py:126] Maximum concurrency for 200 tokens per request: 946.88x
Killed

Image

By ollama , I can run qwen2.5-7b on the same device.

@johnnynunez
Copy link
Collaborator

Because each framework has differente behavior of how to process the model.
Execution Requirements: The vLLM framework uses additional memory to handle token generation, input/output management, and other tasks. Also, with vllm is all up to you, memory size, input tokens length etc

Also, you are using an old version of vllm. @dusty-nv This docker also must be update because with 0.6.3.post1 we have natively vllm on jetson and it has optimizations for unified memory

@HouLingLXH
Copy link
Author

HouLingLXH commented Jan 23, 2025

Is 16G enough to run Qwen2-7B-Instruct.Q4_K_M.gguf in Jetson Orin NX?

Image
Can not run it in dustynv/vllm:0.6.6.post1-r36.4.0 too.

@dusty-nv
Copy link
Owner

dusty-nv commented Jan 23, 2025 via email

@leon-seidel
Copy link
Contributor

leon-seidel commented Jan 28, 2025

I would recommend to add --swap_space 0 in the command, otherwise vLLM will allocate another 4 GB of unified memory. Adding --enforce-eager helps with some models, too. You might also get further speedups with using models quantized with llm-compressor, which should be more optimized than GGUFs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants