-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does vllm consume so much memory? #795
Comments
Because each framework has differente behavior of how to process the model. Also, you are using an old version of vllm. @dusty-nv This docker also must be update because with 0.6.3.post1 we have natively vllm on jetson and it has optimizations for unified memory |
I will be rebuilding vLLM, yes. And often in "server mode", the default setting may be to pre-allocate more memory than necessary.
…________________________________
From: Johnny ***@***.***>
Sent: Thursday, January 23, 2025 3:08 AM
To: dusty-nv/jetson-containers ***@***.***>
Cc: Dustin Franklin ***@***.***>; Mention ***@***.***>
Subject: Re: [dusty-nv/jetson-containers] Why does vllm consume so much memory? (Issue #795)
Because each framework has differente behavior of how to process the model.
Execution Requirements: The vLLM framework uses additional memory to handle token generation, input/output management, and other tasks. Also, with vllm is all up to you, memory size, input tokens length etc
Also, you are using an old version of vllm. @dusty-nv<https://github.com/dusty-nv> This docker also must be update because with 0.6.3.post1 we have natively vllm on jetson and it has optimizations for unified memory
—
Reply to this email directly, view it on GitHub<#795 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADVEGK4E6XSDUANK3Y6HVDT2MCPPXAVCNFSM6AAAAABVWV6JIGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBZGEYTCMRSGA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I would recommend to add |
Jetson Orin NX 16G
jetpack 6.2
docker: dustynv/vllm:0.6.3-r36.4.0
llm = LLM(model="./Qwen2-7B-Instruct.Q4_K_M.gguf",max_model_len = 200,max_num_seqs=1)
Consume more than 16G of memory and terminate:
INFO 01-23 07:32:01 model_runner.py:1060] Starting to load model ./Qwen2-7B-Instruct.Q4_K_M.gguf...
/usr/local/lib/python3.10/dist-packages/torch/nested/init.py:226: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
INFO 01-23 07:32:42 model_runner.py:1071] Loading model weights took 4.3752 GB
INFO 01-23 07:34:26 gpu_executor.py:122] # GPU blocks: 11836, # CPU blocks: 4681
INFO 01-23 07:34:26 gpu_executor.py:126] Maximum concurrency for 200 tokens per request: 946.88x
Killed
By ollama , I can run qwen2.5-7b on the same device.
The text was updated successfully, but these errors were encountered: