VRAM usage with Vulkan backend #1159
-
I have a AMD Radeon RX 5700 XT with 8 GB of VRAM, koboldcpp v1.74 on Windows 10. I am using the Vulkan backend and I do not really understand how the VRAM allocation works. For example, I am now using a Nemo-Mistral-based 12B model. in the Q4_K_M quant. that is 7.3 GB and has 43 layers. If I try to load e.g. 38/41 layers in the GPU, plus a 4096 context size, I get the following error:
This obviously indicates that koboldcpp was not able to allocate the layers in the GPU because it was not able to find enough VRAM. However, expected buffer size is "5907.03 MiB" which is much lower than the available VRAM. In general, the whole model is smaller than the available VRAM. So, how is VRAM used in the Vulkan backend? Is the Vulkan backend somehow "wasting" VRAM? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
The driver or OS may not allow the full amount of a device memory to be used by a single application. It's possible that this setting may also be configurable in your GPU control panel. |
Beta Was this translation helpful? Give feedback.
-
I could not find such a setting, and I do not think neither the OS nor the driver are limiting this. I think the problem is that it is quite hard to estimate the memory used by the context ( Is there a way to ask koboldcpp to print the total amount of VRAM required (model, context, KV cache...) before the allocation request actually happens? |
Beta Was this translation helpful? Give feedback.
It can only do a rough estimate before its loaded, set layers to -1 to auto guess. After it's loaded, you can see the amount used in task manager. Unfortunately some trial and error may be needed.