Sgemm ppc mma #2

amritahs-ibm · 2024-10-28T16:56:46Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ggml-ci

* AVX512 version of ggml_gemm_q4_0_8x8_q8_0 * Remove zero vector parameter passing * Rename functions and rearrange order of macros * Edit commments * style : minor adjustments * Update x to start from 0 --------- Co-authored-by: Georgi Gerganov <[email protected]>

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/4f807e8940284ad7925ebd0a0993d2a1791acb2f?narHash=sha256-IiA3jfbR7K/B5%2B9byVi9BZGWTD4VSbWe8VLpp9B/iYk%3D' (2024-09-11) → 'github:NixOS/nixpkgs/c04d5652cfa9742b1d519688f65d1bbccea9eb7e?narHash=sha256-PmUr/2GQGvFTIJ6/Tvsins7Q43KTMvMFhvG6oaYK%2BWk%3D' (2024-09-19) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…gerganov#9598) Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases. Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended. --- Here is the original description and suggestions from Willy Tarreau : There's currently some false sharing between n_barrier and n_barrier_passed that is amplified in ggml_barrier() by the fact that all threads need to increment n_barrier when entering, while all previous threads continue to read n_barrier_passed, waiting for the last one to release them all. The side effect is that all these readers are slowing down all new threads by making the cache line bounce back and forth between readers and writers. Just placing them in two distinct cache lines is sufficient to boost the performance by 21% on a 80-core ARM server compared to the no-openmp version, and by 3% compared to the openmp version. Note that the variables could have been spread apart in the structure as well, but it doesn't seem that the size of this threadpool struct is critical so here we're simply aligning them. Finally, the same issue was present when leaving the barrier since all threads had to update the n_barrier_passed counter, though only one would add a non-zero value. This alone is responsible for half of the cost due to undesired serialization. It might be possible that using a small array of n_barrier counters could make things even faster on many-core systems, but it would likely complicate the logic needed to detect the last thread. Co-authored-by: Willy Tarreau <[email protected]>

* server : add --no-context-shift option * small fix * Update examples/server/tests/features/embeddings.feature Co-authored-by: Georgi Gerganov <[email protected]> * tests : minor fix * revert usage of GGML_ASSERT * update server documentation --------- Co-authored-by: Georgi Gerganov <[email protected]>

llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.

We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.

…9605) * sampling : avoid expensive softmax during greedy sampling ggml-ci * speculative : fix default RNG seed + set sparams.n_probs * Update tests/test-sampling.cpp Co-authored-by: slaren <[email protected]> * sampling : add clarifying comment [no ci] --------- Co-authored-by: slaren <[email protected]>

ggml-ci

…gerganov#9627)

* feat(gguf-py): Add granitemoe architecture This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * feat(convert_hf_to_gguf): Add GraniteMoeModel GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * fix(granitemoe convert): Split the double-sized input layer into gate and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * feat(granitemoe): Implement granitemoe GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * Typo fix in docstring Co-Authored-By: [email protected] Co-authored-by: Georgi Gerganov <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> * fix(conversion): Simplify tensor name mapping in conversion Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Remove unused tensor name mappings Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Sanity check on merged FFN tensor sizes Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix: Allow "output" layer in granite moe architecture (convert and cpp) Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(granite): Add missing 'output' tensor for Granite This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* server : add more env vars, improve gen-docs * update server docs * LLAMA_ARG_NO_CONTEXT_SHIFT

…9217) * ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels * added fallback mechanism when the offline re-quantized model is not optimized for the underlying target. * fix for build errors * remove prints from the low-level code * Rebase to the latest upstream

* ci : fix docker build number and tag name * fine-grant permissions

Signed-off-by: Xiaodong Ye <[email protected]>

* update oneapi to 2024.2 * use 2024.1 --------- Co-authored-by: arthw <[email protected]>

…rganov#9251)

* ggml: Added run-time detection of neon, i8mm and sve Adds run-time detection of the Arm instructions set features neon, i8mm and sve for Linux and Apple build targets. * ggml: Extend feature detection to include non aarch64 Arm arch * ggml: Move definition of ggml_arm_arch_features to the global data section

@compilade

* convert chameleon hf to gguf * add chameleon tokenizer tests * fix lint * implement chameleon graph * add swin norm param * return qk norm weights and biases to original format * implement swin norm * suppress image token output * rem tabs * add comment to conversion * fix ci * check for k norm separately * adapt to new lora implementation * fix layer input for swin norm * move swin_norm in gguf writer * add comment regarding special token regex in chameleon pre-tokenizer * Update src/llama.cpp Co-authored-by: compilade <[email protected]> * fix punctuation regex in chameleon pre-tokenizer (@compilade) Co-authored-by: compilade <[email protected]> * fix lint * trigger ci --------- Co-authored-by: compilade <[email protected]>

* refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * refactor tokenizer * refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * remove unused files * remove unused fileds to avoid unused filed build error * avoid symbol link error * Update src/llama.cpp * Update src/llama.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>

* sampling : add DRY sampler (post-refactor) * DRY: Trying to fix coauthors, removed unneeded line * DRY: Fixed redundant code * DRY: Fixed crash issue due to DRY being in chain but uninitialized --------- Co-authored-by: l3utterfly <[email protected]> Co-authored-by: pi6am <[email protected]>

* metal : support permuted matrix multiplicaions ggml-ci * cont : use nb01 directly for row steps ggml-ci * cont : add comments [no ci] * metal : minor refactor * metal : minor

Co-authored-by: bssrdf <[email protected]>

…ov#10015) ggml-ci

ggml-ci

Signed-off-by: Xiaodong Ye <[email protected]>

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/4c2fcb090b1f3e5b47eaa7bd33913b574a11e0a0?narHash=sha256-/uilDXvCIEs3C9l73JTACm4quuHUsIHcns1c%2BcHUJwA%3D' (2024-10-18) → 'github:NixOS/nixpkgs/2768c7d042a37de65bb1b5b3268fc987e534c49d?narHash=sha256-AlcmCXJZPIlO5dmFzV3V2XF6x/OpNWUV8Y/FMPGd8Z4%3D' (2024-10-23) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Signed-off-by: Amrita H S <[email protected]>

amritahs-ibm · 2024-10-28T16:59:46Z

wrong pull request created. so closing this

ggerganov and others added 30 commits September 23, 2024 11:27

metal : use F32 prec for K*Q in vec FA (ggerganov#9595)

bf9c101

ggml-ci

perplexity : remove extra new lines after chunks (ggerganov#9596)

37f8c7b

readme : add programmable prompt engine language CLI (ggerganov#9599)

1d48e98

cuda: add q8_0->f32 cpy operation (ggerganov#9571)

116efee

llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.

threads: fix msvc build without openmp (ggerganov#9615)

c087b6f

We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.

server : add newline after chat example (ggerganov#9616)

0aa1501

log : add CONT level for continuing previous log entry (ggerganov#9610)

cea1486

llama : keep track of all EOG tokens in the vocab (ggerganov#9609)

31ac583

ggml-ci

examples : adapt to ggml.h changes (ggml/0)

c038931

ggml-ci

sync : ggml

bb5f819

ggml : add AVX512DQ requirement for AVX512 builds (ggerganov#9622)

70392f1

cann: fix crash when llama-bench is running on multiple cann devices (g…

904837e

…gerganov#9627)

server : add more env vars, improve gen-docs (ggerganov#9635)

afbbfaa

* server : add more env vars, improve gen-docs * update server docs * LLAMA_ARG_NO_CONTEXT_SHIFT

ci : fix docker build number and tag name (ggerganov#9638)

ea9c32b

* ci : fix docker build number and tag name * fine-grant permissions

mtgpu: enable VMM (ggerganov#9597)

7691654

Signed-off-by: Xiaodong Ye <[email protected]>

[SYCL] add missed dll file in package (ggerganov#9577)

95bc82f

* update oneapi to 2024.2 * use 2024.1 --------- Co-authored-by: arthw <[email protected]>

cmake : add option for common library (ggerganov#9661)

44f59b4

readme : update hot topics

b5de3b7

Enable use to the rebar feature to upload buffers to the device. (gge…

89f9944

…rganov#9251)

readme : add tool (ggerganov#9655)

43bcdd9

wwoodsTM and others added 11 commits October 25, 2024 19:07

metal : support permuted matrix multiplicaions (ggerganov#10033)

6687503

* metal : support permuted matrix multiplicaions ggml-ci * cont : use nb01 directly for row steps ggml-ci * cont : add comments [no ci] * metal : minor refactor * metal : minor

scripts : fix amx sync [no ci]

9e4a256

increase cuda_cpy block size (ggml/996)

8c60a8a

Co-authored-by: bssrdf <[email protected]>

sync : ggml

cc2983d

llama : switch KQ multiplication to F32 precision by default (ggergan…

8841ce3

…ov#10015) ggml-ci

server : don't overfill the batch during infill (ggerganov#10018)

8125e6c

ggml-ci

musa: workaround for Guilty Lockup in cleaning src0 (ggerganov#10042)

524afee

Signed-off-by: Xiaodong Ye <[email protected]>

PPC MMA implementation

46dcd2b

Signed-off-by: Amrita H S <[email protected]>

Merge branch 'ggerganov:master' into sgemm_ppc_mma

4a9a3c8

github-actions bot added documentation Improvements or additions to documentation Kompute Apple Metal SYCL Nvidia GPU Vulkan testing build examples devops python script android server ggml nix labels Oct 28, 2024

amritahs-ibm closed this Oct 28, 2024

amritahs-ibm deleted the sgemm_ppc_mma branch October 28, 2024 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sgemm ppc mma #2

Sgemm ppc mma #2

amritahs-ibm commented Oct 28, 2024

amritahs-ibm commented Oct 28, 2024

Sgemm ppc mma #2

Sgemm ppc mma #2

Conversation

amritahs-ibm commented Oct 28, 2024

amritahs-ibm commented Oct 28, 2024