[Bug]: Non-torch memory tracking fails to account for gpu usage of other processes #18854
Open
1 task done
Labels
bug
Something isn't working
Your current environment
🐛 Describe the bug
The
test_gpu_utilization.py
test is failing in CI but I can also reproduce it on a 80GB A100 GPU.In
https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/llm/test_gpu_utilization.py
, 3 LLM instances are started at the same time with 0.3 memory utilization. The first one is able to start but for the second one,determine_available_memory()
returns a negative number. This is because the code below seems to fail to take into account that non-torch allocated memory could be from other processes. So although the memory is still 70% free when the second instance comes up, it think that it is out of memory.Before submitting a new issue...
The text was updated successfully, but these errors were encountered: