Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sgemm ppc mma #2

Closed
wants to merge 183 commits into from
Closed

Sgemm ppc mma #2

wants to merge 183 commits into from

Conversation

amritahs-ibm
Copy link
Owner

ggerganov and others added 30 commits September 23, 2024 11:27
* AVX512 version of ggml_gemm_q4_0_8x8_q8_0

* Remove zero vector parameter passing

* Rename functions and rearrange order of macros

* Edit commments

* style : minor adjustments

* Update x to start from 0

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/4f807e8940284ad7925ebd0a0993d2a1791acb2f?narHash=sha256-IiA3jfbR7K/B5%2B9byVi9BZGWTD4VSbWe8VLpp9B/iYk%3D' (2024-09-11)
  → 'github:NixOS/nixpkgs/c04d5652cfa9742b1d519688f65d1bbccea9eb7e?narHash=sha256-PmUr/2GQGvFTIJ6/Tvsins7Q43KTMvMFhvG6oaYK%2BWk%3D' (2024-09-19)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…gerganov#9598)

Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing.
This optimization shows performance improvements even for n_threads <= 8 cases.

Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write
in the normal case and just use thread-fence as originally intended.

---
Here is the original description and suggestions from Willy Tarreau :

There's currently some false sharing between n_barrier and
n_barrier_passed that is amplified in ggml_barrier() by the fact that
all threads need to increment n_barrier when entering, while all
previous threads continue to read n_barrier_passed, waiting for the last
one to release them all. The side effect is that all these readers are
slowing down all new threads by making the cache line bounce back and
forth between readers and writers.

Just placing them in two distinct cache lines is sufficient to boost
the performance by 21% on a 80-core ARM server compared to the
no-openmp version, and by 3% compared to the openmp version.

Note that the variables could have been spread apart in the structure
as well, but it doesn't seem that the size of this threadpool struct is
critical so here we're simply aligning them.

Finally, the same issue was present when leaving the barrier since all
threads had to update the n_barrier_passed counter, though only one
would add a non-zero value. This alone is responsible for half of the
cost due to undesired serialization.

It might be possible that using a small array of n_barrier counters
could make things even faster on many-core systems, but it would likely
complicate the logic needed to detect the last thread.

Co-authored-by: Willy Tarreau <[email protected]>
* server : add --no-context-shift option

* small fix

* Update examples/server/tests/features/embeddings.feature

Co-authored-by: Georgi Gerganov <[email protected]>

* tests : minor fix

* revert usage of GGML_ASSERT

* update server documentation

---------

Co-authored-by: Georgi Gerganov <[email protected]>
llama: enable K-shift for quantized KV cache
It will fail on unsupported backends or quant types.
We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.
…9605)

* sampling : avoid expensive softmax during greedy sampling

ggml-ci

* speculative : fix default RNG seed + set sparams.n_probs

* Update tests/test-sampling.cpp

Co-authored-by: slaren <[email protected]>

* sampling : add clarifying comment [no ci]

---------

Co-authored-by: slaren <[email protected]>
* feat(gguf-py): Add granitemoe architecture

This includes the addition of new tensor names for the new moe layers.
These may not be correct at this point due to the need for the hack in
gguf_writer.py to double-check the length of the shape for these layers.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(convert_hf_to_gguf): Add GraniteMoeModel

GraniteMoe has the same configuration deltas as Granite

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(granitemoe convert): Split the double-sized input layer into gate and up

After a lot of staring and squinting, it's clear that the standard mixtral
expert implementation is equivalent to the vectorized parallel experts in
granite. The difference is that in granite, the w1 and w3 are concatenated
into a single tensor "input_linear." Rather than reimplementing all of the
math on the llama.cpp side, the much simpler route is to just split this
tensor during conversion and follow the standard mixtral route.

Branch: GraniteMoE

Co-Authored-By: [email protected]

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(granitemoe): Implement granitemoe

GraniteMoE follows the mixtral architecture (once the input_linear layers
are split into gate_exps/up_exps). The main delta is the addition of the
same four multipliers used in Granite.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <[email protected]>

* Typo fix in docstring

Co-Authored-By: [email protected]

Co-authored-by: Georgi Gerganov <[email protected]>
Signed-off-by: Gabe Goodhart <[email protected]>

* fix(conversion): Simplify tensor name mapping in conversion

Branch: GraniteMoE

Co-Authored-By: [email protected]
Signed-off-by: Gabe Goodhart <[email protected]>

* fix(convert): Remove unused tensor name mappings

Branch: GraniteMoE

Co-Authored-By: [email protected]
Signed-off-by: Gabe Goodhart <[email protected]>

* fix(convert): Sanity check on merged FFN tensor sizes

Branch: GraniteMoE

Co-Authored-By: [email protected]
Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Allow "output" layer in granite moe architecture (convert and cpp)

Branch: GraniteMoE

Co-Authored-By: [email protected]
Signed-off-by: Gabe Goodhart <[email protected]>

* fix(granite): Add missing 'output' tensor for Granite

This is a fix for the previous `granite` architecture PR. Recent snapshots
have included this (`lm_head.weights`) as part of the architecture

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
* server : add more env vars, improve gen-docs

* update server docs

* LLAMA_ARG_NO_CONTEXT_SHIFT
…9217)

* ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels

* added fallback mechanism when the offline re-quantized model is not
optimized for the underlying target.

* fix for build errors

* remove prints from the low-level code

* Rebase to the latest upstream
* ci : fix docker build number and tag name

* fine-grant permissions
* update oneapi to 2024.2

* use 2024.1

---------

Co-authored-by: arthw <[email protected]>
* ggml: Added run-time detection of neon, i8mm and sve

Adds run-time detection of the Arm instructions set features
neon, i8mm and sve for Linux and Apple build targets.

* ggml: Extend feature detection to include non aarch64 Arm arch

* ggml: Move definition of ggml_arm_arch_features to the global data section
* convert chameleon hf to gguf

* add chameleon tokenizer tests

* fix lint

* implement chameleon graph

* add swin norm param

* return qk norm weights and biases to original format

* implement swin norm

* suppress image token output

* rem tabs

* add comment to conversion

* fix ci

* check for k norm separately

* adapt to new lora implementation

* fix layer input for swin norm

* move swin_norm in gguf writer

* add comment regarding special token regex in chameleon pre-tokenizer

* Update src/llama.cpp

Co-authored-by: compilade <[email protected]>

* fix punctuation regex in chameleon pre-tokenizer (@compilade)

Co-authored-by: compilade <[email protected]>

* fix lint

* trigger ci

---------

Co-authored-by: compilade <[email protected]>
* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* refactor tokenizer

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* remove unused files

* remove unused fileds to avoid unused filed build error

* avoid symbol link error

* Update src/llama.cpp

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <[email protected]>
wwoodsTM and others added 11 commits October 25, 2024 19:07
* sampling : add DRY sampler (post-refactor)

* DRY: Trying to fix coauthors, removed unneeded line

* DRY: Fixed redundant code

* DRY: Fixed crash issue due to DRY being in chain but uninitialized

---------

Co-authored-by: l3utterfly <[email protected]>
Co-authored-by: pi6am <[email protected]>
* metal : support permuted matrix multiplicaions

ggml-ci

* cont : use nb01 directly for row steps

ggml-ci

* cont : add comments [no ci]

* metal : minor refactor

* metal : minor
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/4c2fcb090b1f3e5b47eaa7bd33913b574a11e0a0?narHash=sha256-/uilDXvCIEs3C9l73JTACm4quuHUsIHcns1c%2BcHUJwA%3D' (2024-10-18)
  → 'github:NixOS/nixpkgs/2768c7d042a37de65bb1b5b3268fc987e534c49d?narHash=sha256-AlcmCXJZPIlO5dmFzV3V2XF6x/OpNWUV8Y/FMPGd8Z4%3D' (2024-10-23)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: Amrita H S <[email protected]>
@amritahs-ibm
Copy link
Owner Author

wrong pull request created. so closing this

@amritahs-ibm amritahs-ibm deleted the sgemm_ppc_mma branch October 28, 2024 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.