VRAM usage with Vulkan backend #1159

albertosottile · 2024-10-12T20:02:03Z

albertosottile
Oct 12, 2024

I have a AMD Radeon RX 5700 XT with 8 GB of VRAM, koboldcpp v1.74 on Windows 10.

I am using the Vulkan backend and I do not really understand how the VRAM allocation works. For example, I am now using a Nemo-Mistral-based 12B model. in the Q4_K_M quant. that is 7.3 GB and has 43 layers. If I try to load e.g. 38/41 layers in the GPU, plus a 4096 context size, I get the following error:

Vulkan0: AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size =    0.40 MiB
llm_load_tensors: offloading 38 repeating layers to GPU
llm_load_tensors: offloaded 38/41 layers to GPU
llm_load_tensors: AMD Radeon RX 5700 XT buffer size =  5907.03 MiB
llm_load_tensors:        CPU buffer size =  7123.30 MiB
.........................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 8288
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_vulkan: Device memory allocation of size 1290010624 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
gpttype_load_model: error: failed to load model 'MODELPATH-12B.Q4_K_M.gguf'
Load Text Model OK: False

This obviously indicates that koboldcpp was not able to allocate the layers in the GPU because it was not able to find enough VRAM. However, expected buffer size is "5907.03 MiB" which is much lower than the available VRAM. In general, the whole model is smaller than the available VRAM. So, how is VRAM used in the Vulkan backend? Is the Vulkan backend somehow "wasting" VRAM?

Answered by LostRuins

Oct 14, 2024

It can only do a rough estimate before its loaded, set layers to -1 to auto guess. After it's loaded, you can see the amount used in task manager. Unfortunately some trial and error may be needed.

View full answer

LostRuins · 2024-10-13T12:38:59Z

LostRuins
Oct 13, 2024
Maintainer

The driver or OS may not allow the full amount of a device memory to be used by a single application. It's possible that this setting may also be configurable in your GPU control panel.

0 replies

albertosottile · 2024-10-13T20:02:10Z

albertosottile
Oct 13, 2024
Author

I could not find such a setting, and I do not think neither the OS nor the driver are limiting this.

I think the problem is that it is quite hard to estimate the memory used by the context (llama_new_context_with_model output buffer size, compute buffer size...)). When I decrease the context size to a ridiculously low amount of tokens (256), I am able to fit 43/43 layers in the GPU, filling 7.8 GB of VRAM.

Is there a way to ask koboldcpp to print the total amount of VRAM required (model, context, KV cache...) before the allocation request actually happens?

1 reply

LostRuins Oct 14, 2024
Maintainer

It can only do a rough estimate before its loaded, set layers to -1 to auto guess. After it's loaded, you can see the amount used in task manager. Unfortunately some trial and error may be needed.

Answer selected by albertosottile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VRAM usage with Vulkan backend #1159

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

VRAM usage with Vulkan backend #1159

albertosottile Oct 12, 2024

Replies: 2 comments · 1 reply

LostRuins Oct 13, 2024 Maintainer

albertosottile Oct 13, 2024 Author

LostRuins Oct 14, 2024 Maintainer

albertosottile
Oct 12, 2024

Replies: 2 comments 1 reply

LostRuins
Oct 13, 2024
Maintainer

albertosottile
Oct 13, 2024
Author

LostRuins Oct 14, 2024
Maintainer