Why does GPU memory increase when setting num-gpu-blocks? #10338
hughesadam87
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've set
--num-gpu-blocks-override
directly so that vLLM preallocates 90% of my GPU on startup (model weight and KV Cache).In addition, I've set
--gpu-memory-utilization
to 0.9.However, under load the system keeps allocating more and more memory and eventually hits an OOM. How is this possible - shouldn't the preallocated memory be fixed after startup?
Finally - is there a way to dynamically free up GPU memory? After all requests are processed, the additional memory is never released by vLLM even when system is idle for long periods.
Beta Was this translation helpful? Give feedback.
All reactions