Does Llama.cpp support custom attention head size? #9089

suhara · 2024-08-19T07:26:16Z

suhara
Aug 19, 2024

Usually, the attention head size head_dim = hidden_dim // num_attention_heads in many model architectures including Llama.

Some models use more flexible head_dim sizes such as

https://huggingface.co/nvidia/Minitron-8B-Base/blob/main/config.json#L25

For Llama models, here is one pending PR for HF

Add custom head_dim support to Llama huggingface/transformers#32502

Looking at src/llama.cpp, I feel like the information is handled around here but I'm not sure.

https://github.com/ggerganov/llama.cpp/blob/1b6ff90ff8301d9fe2027be2bb9fea26177d775e/src/llama.cpp#L4698-L4702

Could anybody help me understand how the information is loaded into hparams and can be used in build_*()?

Thank you!

ggerganov · 2024-08-19T07:37:57Z

ggerganov
Aug 19, 2024
Maintainer

The default head size is n_embd/n_head, but as you pointed out in the code block, it can be overridden via the LLM_KV_ATTENTION_KEY_LENGTH and LLM_KV_ATTENTION_VALUE_LENGTH meta pairs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does Llama.cpp support custom attention head size? #9089

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Does Llama.cpp support custom attention head size? #9089

Uh oh!

suhara Aug 19, 2024

Replies: 1 comment

Uh oh!

ggerganov Aug 19, 2024 Maintainer

suhara
Aug 19, 2024

ggerganov
Aug 19, 2024
Maintainer