Compilade/bitnet ternary #294

Nexesenex · 2024-08-14T13:23:45Z

No description provided.

Not using a lookup table anymore makes it match q4_0 speed. * gguf-py : fix formatting * llama : remove spaces on empty line

This makes the 1.625 bpw type go faster than q4_0. Still not the fastest.

This still results in the exact same tensor weights and scales, but it reveals some weirdness in the current algorithm.

Its FFN size is 5460 which is not convenient. The offending tensors are kept in F16, which makes the final model 5.01 bpw.

Same optimization as for TQ2_0 by offsetting the sum instead of the weights. This makes TQ1_0 almost as fast as Q8_0 on AVX2.

The compiler seems smart enough to use the same instruction even when using vget_high_s8 instead.

* llama : remove the separate scale tensors of BitNet b1.58 They won't be needed, since the remaining ternary quant types have built-in scales.

Not yet tested on harware which supports it, might not work or might not even compile. But also it might. It should make the performance better on recent ARM CPUs. * ggml-quants : remove comment about possible format change of TQ2_0 Making it slightly more convenient for AVX512 but less convenient for everything else is not worth the trouble.

…ronization overhead. (#8943) * Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors. * Fix small typo --------- Co-authored-by: 0cc4m <[email protected]>

…8956) Co-authored-by: Stanisław Szymczyk <[email protected]>

Co-authored-by: Neo Zhang <>

* gguf-py : Numpy dequantization for most types * gguf-py : Numpy dequantization for grid-based i-quants

* ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0 This does not change anything for ternary models, since their values should never end up being in halfway cases anyway.

ggml-ci

* py : fix requirements check '==' -> '~=' * cont : fix the fix * ci : run on all requirements.txt

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724 In order to access the above bug you need to login using one of the emails in https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5 Signed-off-by: David Korczynski <[email protected]>

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680 Signed-off-by: David Korczynski <[email protected]>

* readme: introduce gpustack GPUStack is an open-source GPU cluster manager for running large language models, which uses llama.cpp as the backend. Signed-off-by: thxCode <[email protected]> * readme: introduce gguf-parser GGUF Parser is a tool to review/check the GGUF file and estimate the memory usage without downloading the whole model. Signed-off-by: thxCode <[email protected]> --------- Signed-off-by: thxCode <[email protected]>

* llama : model-based max number of graph nodes calculation * Update src/llama.cpp --------- Co-authored-by: slaren <[email protected]>

ref: #8912

Signed-off-by: Diogo Teles Sant'Anna <[email protected]>

* ggml : move rope type enum to ggml.h This commit moves the `llama_rope_type` enum from `llama.h` to `ggml.h` and changes its name to `ggml_rope_type`. The motivation for this change is to address the TODO in `llama.h` and use the enum in ggml. Note: This commit does not change the `mode` parameter to be of type `enum ggml_rope_type`. The name `mode` and its usage suggest that it might be more generic and possibly used as a bit field for multiple flags. Further investigation/discussion may be needed to determine if `mode` should be restricted to RoPE types. * squash! ggml : move rope type enum to ggml.h This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from ggml.h, and back the llama_rope_type enum. I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is safe to remove it yet. * squash! ggml : move rope type enum to ggml.h This commit removes the enum ggml_rope_type from ggml.h and replaces it with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has been updated to reflect this change. * squash! ggml : move rope type enum to ggml.h This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX macro/define to be passed to the shader compiler. * squash! ggml : move rope type enum to ggml.h This commit fixes the editorconfig-checker warnings. * squash! ggml : move rope type enum to ggml.h Update comment for ggml_rope function. * Revert "squash! ggml : move rope type enum to ggml.h" This reverts commit 6261222. * squash! ggml : move rope type enum to ggml.h Add GGML_ROPE_TYPE_NEOX to rope_common.comp. * remove extra line --------- Co-authored-by: slaren <[email protected]>

The token embeddings and output tensors are kept in F16 to allow quantizing them to Q4_K and Q6_K with llama-quantize. * llama : handle fallback for TQ1_0 and TQ2_0 with Q4_0 Q4_0 is not completely symmetric (so not lossless for ternary models), but it should be good enough.

compilade and others added 30 commits June 27, 2024 02:06

ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b

bd80749

ggml-quants : faster 1.625 bpw AVX2 vec_dot

7ef4254

Not using a lookup table anymore makes it match q4_0 speed. * gguf-py : fix formatting * llama : remove spaces on empty line

ggml-quants : substract 1 when back in epi8

48b73b8

This makes the 1.625 bpw type go faster than q4_0. Still not the fastest.

ggml-quants : Q2_2 now faster than Q4_K on with AVX2

ef1e345

ggml-quants : cleanup Q1_3 code formatting

638ad52

ggml-quants : ARM NEON vec_dot for q2_2 and q1_3

9465ec6

ggml-quants : use ceiling division when quantizing q1_3

89dc3b2

convert-hf : simplify BitNet pre-quantization

961e293

This still results in the exact same tensor weights and scales, but it reveals some weirdness in the current algorithm.

convert-hf : allow converting the weird BitNet 1.3B

0996149

Its FFN size is 5460 which is not convenient. The offending tensors are kept in F16, which makes the final model 5.01 bpw.

bitnet : replace 1.58b with b1.58, as in the paper

bfd2f21

ggml-quants : fix build failure on Windows

ec50944

ggml-quants : attempt to fix Arm 32-bit support

8fbd593

ggml : add some informative comments in q1_3 vec_dot

dd3e62a

Merge branch 'master' into compilade/bitnet-ternary

79a278e

ggml : add TQ1_0 and TQ2_0 ternary quantization types

77b8f84

ggml : even faster TQ2_0

560873f

ggml : also faster TQ1_0

e971957

Same optimization as for TQ2_0 by offsetting the sum instead of the weights. This makes TQ1_0 almost as fast as Q8_0 on AVX2.

ggml : fix build issues in certain environments

a6dd699

ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0

5417089

ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat

45719a2

The compiler seems smart enough to use the same instruction even when using vget_high_s8 instead.

ggml : remove q1_3 and q2_2

04eec58

* llama : remove the separate scale tensors of BitNet b1.58 They won't be needed, since the remaining ternary quant types have built-in scales.

ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency

f034aa1

llama : check all graph nodes when searching for result_embd_pooled (#…

33309f6

…8956) Co-authored-by: Stanisław Szymczyk <[email protected]>

update guide (#8909)

a21c6fd

Co-authored-by: Neo Zhang <>

flake.lock: Update (#8979)

8cd1bcf

gguf-py : Numpy dequantization for most types (#8939)

4134999

* gguf-py : Numpy dequantization for most types * gguf-py : Numpy dequantization for grid-based i-quants

Merge branch 'master' into compilade/bitnet-ternary

d911cd1

gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0

3a0bf17

* ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0 This does not change anything for ternary models, since their values should never end up being in halfway cases anyway.

ggerganov and others added 15 commits August 12, 2024 10:21

server : handle models with missing EOS token (#8997)

5ef07e2

ggml-ci

py : fix requirements check '==' -> '~=' (#8982)

d3ae0ee

* py : fix requirements check '==' -> '~=' * cont : fix the fix * ci : run on all requirements.txt

Fix a spelling mistake (#9001)

2589292

grammar-parser : fix possible null-deref (#9004)

1262e7e

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680 Signed-off-by: David Korczynski <[email protected]>

llama : model-based max number of graph nodes calculation (#8970)

0fd93cd

* llama : model-based max number of graph nodes calculation * Update src/llama.cpp --------- Co-authored-by: slaren <[email protected]>

ci : enable RPC in all of the released builds (#9006)

1f67436

ref: #8912

ci : fix github workflow vulnerable to script injection (#9008)

fc4ca27

Signed-off-by: Diogo Teles Sant'Anna <[email protected]>

export-lora : throw error if lora is quantized (#9002)

828d6ff

ggml-quants : allow using ARM dot product instructions for TQ1_0

69f7726

Merge branch 'master' into compilade/bitnet-ternary

82b2404

ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support

35cc556

Nexesenex merged commit 37dfe57 into Nexesenex:lcpp_pr_bitnet_v2 Aug 14, 2024
8 of 11 checks passed

github-actions bot added documentation Improvements or additions to documentation Nvidia GPU testing examples python server ggml devops SYCL Vulkan labels Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compilade/bitnet ternary #294

Compilade/bitnet ternary #294

Nexesenex commented Aug 14, 2024

Compilade/bitnet ternary #294

Compilade/bitnet ternary #294

Conversation

Nexesenex commented Aug 14, 2024