Skip to content

Kobold.CPP_FrankenFork_v1.68w_b3235+3

Compare
Choose a tag to compare
@Nexesenex Nexesenex released this 27 Jun 21:59
· 163 commits to factor_x since this release

Frankenstein 1.68w "Fork" of KoboldCPP Experimental up to the 27/06/2024, 15h GMT+2.
Based on Llama.CPP b3235, and aimed mainly at Turing, Ampere and Ada GPUs users.
Rebased on KoboldCPP experimental from the 26/06/2024 to fix several issues.

In its Cuda 12 version, probably the fastest series of KCCP-F ever released in terms of prompt processing.

DISCLAIMER:

The KoboldCPP-Frankenstein builds are not supported by the KoboldCPP team, Github, or discord channel. They are for greedy-test and amusement only.
My KCPP-Frankenstein version number bumps as soon as the version number in the official experimental branch bumps. They are not "upgrades" over the official version. And they might be bugged at time: only the official KCPP releases are to be considered correctly numbered, reliable and "fixed".
The LllamaCPP version + the additional PRs integrated follow my KCPP-Frankenstein versioning in the title, so everybody knows what version they deal with.

For KCPP official version, it's here : https://github.com/LostRuins/koboldcpp/releases

FRANKENSTEIN FEATURES:

  • Enhanced benchmark (reflecting a maximum of indicators, including the KV cache option. Now integrated in a slightly revamped version in the official version.

  • 21 KV cache options (all should be considered experimental except F16 and KV Q8_0) 👍

F16 -> Fullproof (the usual KV quant since the beginning of LCPP/KCPP)
K F16 with : V Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
K Q8_0 with : V Q8_0 (stable, my current main, part of the LCPP/KCPP main triplet), Q5_1 (maybe unstable), Q5_0 (maybe unstable), Q4_1 (maybe stable), the rest is untested beyond benches), Q4_0 (maybe stable)
K Q5_1 with : V Q5_1, Q5_0, Q4_1, Q4_0
K Q5_0 with : V Q5_0, Q4_1, V Q4_0
K Q4_1 with : V Q4_1 (stable), Q4_0 (maybe stable)
KV Q4_0 (quite stable, if we consider that it's part of the LCPP/KCPP main triplet)
Works in command line, normally also via the GUI, and normally saves on .KCPPS config files.

  • A better Autorope thanks to askmyteapot's PR on KoboldCPP official github, and, for Llama models, with an additional negative offset to lower a bit the L1/L2 rope, as well as a positive offset for SOLAR models, and improve the perplexity (L1,L2, Solar) or avoid to degrade too much the reasoning abilities (L3, not implemented yet) at equal context.

  • Faster PP, in both Cublas and MMQ, thanks to Johannes Gaessler work on these kernels, and a compilation with Cuda Arch=75 on the top of the already recent 60,61,70 combo. Blasbatchsize 512 is still optimal, but 256 is not far from it and is now used by default. 64 is perfectly usable and optimal for VRAM-limited scenarios.

  • A slight deboost on the pipeline parallelization, set from 4 to 2. 0.5-1% VRAM saved, and less stress on the graphic cards.

  • Bitnet PR integrated, works in CPU mode (only recent models Bitnet properly converted work, older ones do not).
    Example : https://huggingface.co/BoscoTheDog/bitnet_b1_58-xl_q8_0_gguf/tree/main

ARGUMENTS 👍(to be edited, check them in CLI or use the GUI)

Note : I had to use a simple 0-20 numbering scheme to allow the GUI and the kcpps preset saving to work properly. The problems with the previous 4 numbers quant scheme are fixed.

--quantkv",
help="Sets the KV cache data type quantization.

0 = 1616/F16 (16 BPW),

1 = 1680/Kf16-Vq8_0 (12.25BPW),
2 = 1651/Kf16-Vq5_1 (11BPW),
3 = 1650/Kf16-Vq5_0 (10.75BPW),
4 = 1641/Kf16-Vq4_1 (10.5BPW),
5 = 1640/Kf16-Vq4_0 (10.25BPW),

6 = 8080/KVq8_0 (8.5 BPW),
7 = 8051/Kq8_0-Vq5_1 (7.25BPW),
8 = 8050/Kq8_0-Vq5_0 (7BPW),
9 = 8041/Kq8_0-Vq4_1 (6.75BPW),
10 = 8040/Kq8_0-Vq4_0 (6.5BPW),

11 = 5151/KVq5_1 (6BPW),
12 = 5150/Kq5_1-Vq5_0 (5.75BPW),
13 = 5141/Kq5_1-Vq4_1 (5.5BPW),
14 = 5140/Kq5_1-Vq4_0 (5.25BPW),

15 = 5050/Kq5_0-Vq5_0 (5.5BPW),
16 = 5041/Kq5_0-Vq4_1 (5.25BPW),
17 = 5040/Kq5_0-Vq4_0 (5BPW),

18 = 4141/Kq4_1-Vq4_1 (5BPW),
19 = 4140/Kq4_1-Vq4_0 (4.75BPW),
20 = 4040/KVq4_0 (4.5BPW)

choices=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], default=0)

Lowvram option's speed is (logically) boosted due to the smaller KV context in RAM. From 25%+ in KV Q8_0 to 50%+ in KV Q4_0.

REMARKS :
You MUST use Flash attention for anything else than QKV=0 (F16)
(tag : --flashattention in CL, or in the GUI)
Contextshift doesn't work with anything else than KV F16, but Smartcontext does.

CREDITS :
Of course, all credits go to Concedo/LostRuins and the other contributors to KoboldCPP, and to GGermanov and all the other contributors to LlamaCPP. Special big-up to Johannes Gaessler for the quantized KV cache!
I'm just poking, merging, and building around their work.

BUILDS :
All builds, aka. Cublas 12.1/12.2 (recommended for Ada Lovelace and Ampere for additional PP speed), Cublas 11.4.4/11.7 (more pertinent than the 12.2 for Pascal, Maxwell.. And 11.4.4 for Kepler), and the standard one are including OpenBLAS, CLBLAST, and Vulkan support provided by the devs.

Full Changelog: v1.68t_b3230-3+2...v1.68w_b3235+3