b2998 #136

Nexesenex · 2024-05-25T13:32:10Z

No description provided.

* common : increase max number of experts to 128 * common : add tensor LLM_TENSOR_FFN_NORM_EXPS for normalization before MoE that runs in parallel to attention + ffn * gguf-py : add architecture-specific block mappings that override selected general block mappings * convert-hf : add model conversion support for ArcticForCausalLM * convert-hf : use added_tokens_decoder from tokenizer_config.json to redefine tokens from SentencePiece model (only for ArcticForCausalLM) * llama : add inference support for LLM_ARCH_ARCTIC --------- Co-authored-by: Stanisław Szymczyk <[email protected]>

* docker.yml: disable light-intel test * docker.yml: disable server-intel test

Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/e5d10a24b66c3ea8f150e47dfdb0416ab7c3390e?narHash=sha256-yzcRNDoyVP7%2BSCNX0wmuDju1NUCt8Dz9%2BlyUXEI0dbI%3D' (2024-05-02) → 'github:hercules-ci/flake-parts/8dc45382d5206bd292f9c2768b8058a8fd8311d9?narHash=sha256-/GJvTdTpuDjNn84j82cU6bXztE0MSkdnTWClUCRub78%3D' (2024-05-16) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/63c3a29ca82437c87573e4c6919b09a24ea61b0f?narHash=sha256-4cPymbty65RvF1DWQfc%2BBc8B233A1BWxJnNULJKQ1EY%3D' (2024-05-02) → 'github:NixOS/nixpkgs/4a6b83b05df1a8bd7d99095ec4b4d271f2956b64?narHash=sha256-%2BNpbZRCRisUHKQJZF3CT%2Bxn14ZZQO%2BKjxIIanH3Pvn4%3D' (2024-05-17) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* gguf-py : fix and simplify quantized shape round-trip * gguf-py : remove unused import

* Make tokenizer.cpp CLI tool nicer. Before this commit, tokenize was a simple CLI tool like this: tokenize MODEL_FILENAME PROMPT [--ids] This simple tool loads the model, takes the prompt, and shows the tokens llama.cpp is interpreting. This changeset makes the tokenize more sophisticated, and more useful for debugging and troubleshooting: tokenize [-m, --model MODEL_FILENAME] [--ids] [--stdin] [--prompt] [-f, --file] [--no-bos] [--log-disable] It also behaves nicer on Windows now, interpreting and rendering Unicode from command line arguments and pipes no matter what code page the user has set on their terminal. * style fix: strlen(str) == 0 --> *str == 0 * Simplify tokenize.cpp; by getting rid of handling positional style arguments. It must now be invoked with long --model, --prompt etc. arguments only. Shortens the code. * tokenize.cpp: iostream header no longer required --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: brian khuu <[email protected]>

* fix missing slash in fs_get_cache_directory() * use LOCALAPPDATA for fs_get_cache_directory() * better code style

* move ndk code to a new library * add gradle file

* Add SVE support for q4_0_q8_0 q8_0_q8_0 * remove ifdef

* main : don't print special tokens with --grammar The CLI interface was recently changed to print special control tokens like the </s> stop message one. This token shouldn't be printed if the grammar flag was passed, unless the grammar specifies it, because that breaks shell-scriptability. * main: use seperate stream for control characters * main: use dprintf and add --ctrl-token-no-out and --ctrl-token-fd-out * main: dprintf isn't part of the IEEE POSIX standard. Just use write(). * main: remove --ctrl-token-fd-out in favor for fcntl() based detection * common.cpp: accidentally removed --interactive-first * main: only merge stdout and control token if not in conversation or grammar mode * main: rejig control token descriptor handling * main: must check pipe status on very top of program * main: renamed --no-special from --ctrl-token-no-out and other refactoring * main: refactor ctrl_token_no_out --> no_special * llama: rename llama_token_is_control_token() to llama_token_is_control() * main: remove special token file descriptor feature (#5) --------- Co-authored-by: Brian <[email protected]>

* labeler: added Apple Metal detector [no ci] * labeler: add Kompute to detector [no ci]

* q2_k_r4: Zen4 PP-512(LLaMA-3.1-8B) = 256 t/s * q3_k_r4: AVX2 * q2_k_r4: AVX2 We get PP-512(LLaMA-3.1-8B) = 287 t/s. Also cherry-picked the q3_k_r4 AVX2 adaptation that I somehow forgot to push upstream. * q2_k_r4: NEON We get PP-512(LLaMA-3.1-8B) = 106.2 t/s. TG-128 is 36.02 t/s, which is ~10% higher than q2_K_S. * Make sure rows per thread are a multiple of 4 --------- Co-authored-by: Iwan Kawrakow <[email protected]>

fairydreaming and others added 11 commits May 24, 2024 14:31

docker.yml: disable light-intel and server-intel test (#7515)

27891f6

* docker.yml: disable light-intel test * docker.yml: disable server-intel test

gguf-py : fix and simplify quantized shape round-trip (#7483)

b83bab1

* gguf-py : fix and simplify quantized shape round-trip * gguf-py : remove unused import

fix missing slash in fs_get_cache_directory() (#7503)

902184d

* fix missing slash in fs_get_cache_directory() * use LOCALAPPDATA for fs_get_cache_directory() * better code style

android : module (#7502)

9791f40

* move ndk code to a new library * add gradle file

ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (#7433)

faa0e69

* Add SVE support for q4_0_q8_0 q8_0_q8_0 * remove ifdef

labeler: added Apple Metal detector (+Kompute) (#7529)

3cbd23e

* labeler: added Apple Metal detector [no ci] * labeler: add Kompute to detector [no ci]

train : change default FA argument (#7528)

9588f19

github-actions bot added examples python ggml devops build android labels May 25, 2024

Nexesenex merged commit a4def5e into Nexesenex:Nexes_quants May 25, 2024
28 of 68 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b2998 #136

b2998 #136

Nexesenex commented May 25, 2024

b2998 #136

b2998 #136

Conversation

Nexesenex commented May 25, 2024