draft : Caching device_info in device_ext #3

OuadiElfarouki · 2024-07-03T12:15:43Z

No description provided.

…itorconfig step of CI. (ggerganov#8258)

… upgrade / migration confusion arising from ggerganov#7809. (ggerganov#8257)

…ggerganov#8261)

…tter wrapper in dev_mgr

* Single load for half2 * Store scales in local mem * Vec load quantized values

Co-authored-by: Judd <[email protected]>

* ppl : fix n_seq_max for perplexity * use 1 seq for kl_divergence

ggml-ci

* llama : suppress unref var in Windows MSVC This commit suppresses two warnings that are currently generated for src/llama.cpp when building on Windows MSVC ```console C:\llama.cpp\src\llama.cpp(14349,45): warning C4101: 'ex': unreferenced local variable [C:\llama.cpp\build\src\llama.vcxproj] C:\llama.cpp\src\llama.cpp(19285,44): warning C4101: 'e': unreferenced local variable [C:\llama.cpp\build\src\llama.vcxproj] ``` * Update src/llama.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>

This commit adds the compile definition `_CRT_SECURE_NO_WARNINGS` to the root cmake subproject. The motivation for this is that currently the following warnings are displayed when compiling the tests and common cmake subprojects: ```console test-llama-grammar.cpp C:\llama.cpp\src\.\llama.cpp(1406,77): warning C4996: 'strerror': This function or variable may be unsafe. Consider using strerror_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [C:\llama.cpp\build\tests\test-llama-grammar.vcxproj] ... ``` This compile definition is currently set for the `src` subproject and this change moves into the root cmake project so that it is applied to all cmake subprojects.

* llama : add inference support and model types for T5 and FLAN-T5 model families * llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token() * common, llama-cli, llama-batched : add support for encoder-decoder models * convert-hf : handle shared token embeddings tensors in T5Model * convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models) * convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model * convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5 --------- Co-authored-by: Stanisław Szymczyk <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

Not namespaced though :(

* CUDA: optimize and refactor MMQ * explicit q8_1 memory layouts, add documentation

* cuda : suppress 'noreturn' warn in no_device_code This commit adds a while(true) loop to the no_device_code function in common.cuh. This is done to suppress the warning: ```console /ggml/src/ggml-cuda/template-instances/../common.cuh:346:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn] 346 | } | ^ ``` The motivation for this is to reduce the number of warnings when compilng with GGML_HIPBLAS=ON. Signed-off-by: Daniel Bevenius <[email protected]> * squash! cuda : suppress 'noreturn' warn in no_device_code Update __trap macro instead of using a while loop to suppress the warning. Signed-off-by: Daniel Bevenius <[email protected]> --------- Signed-off-by: Daniel Bevenius <[email protected]>

* ggml : add NVPL BLAS support * ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>` --------- Co-authored-by: ntukanov <[email protected]>

* fix part of mul_mat_id * skip the bfloat 16 sycl ut Signed-off-by: Chen Xi <[email protected]> --------- Signed-off-by: Chen Xi <[email protected]> Co-authored-by: Meng, Hengyu <[email protected]> Co-authored-by: Chen Xi <[email protected]>

* ggml : minor naming changes ggml-ci * ggml : use PRId64 [no ci] * ggml : revert FA K/Q names

* examples : sprintf -> snprintf ggml-ci * examples : use sizeof() instead of hardcoded constants

The <filename> token used by Refact doesn't serve the same purpose as the <file_separator> from CodeGemma. Signed-off-by: Jiri Podivin <[email protected]>

…v#8441) Commit b0a4699 changed the name of this script from convert-hf-to-gguf.py to convert_hf_to_gguf.py breaking how convert is called from within a Docker container.

…anov#8420) * make sure batches are all embed or all non-embed * non-embedding batch for sampled tokens; fix unused params warning

This commit updates the _try_copy lambda and moves the unary minus operator to after the cast to int32_t. The motivation for this that currently the following warning is generated on windows: ```console llama.cpp\src\llama.cpp(21147,30): warning C4146: unary minus operator applied to unsigned type, result still unsigned ```

* server : handle content array in chat API * Update examples/server/utils.hpp Co-authored-by: Xuan Son Nguyen <[email protected]> --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

ggml-ci

* Add Vulkan to CMake pkg * Add Sycl to CMake pkg * Add OpenMP to CMake pkg * Split generated shader file into separate translation unit * Add CMake target for Vulkan shaders * Update README.md * Add make target for Vulkan shaders * Use pkg-config to locate vulkan library * Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow * Clean up tabs * Move sudo to apt-key invocation * Forward GGML_EXTRA_LIBS to CMake config pkg * Update vulkan obj file paths * Add shaderc to nix pkg * Add python3 to Vulkan nix build * Link against ggml in cmake pkg * Remove Python dependency from Vulkan build * code review changes * Remove trailing newline * Add cflags from pkg-config to fix w64devkit build * Update README.md * Remove trailing whitespace * Update README.md * Remove trailing whitespace * Fix doc heading * Make glslc required Vulkan component * remove clblast from nix pkg

) * llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from ggerganov#8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with ggerganov#8379 * test-tokenizer-random : add a failing edge case for falcon

* gguf_hash.py: Add sha256 * gguf_hash.py: rename string UUIDv5 --> uuid * Apply suggestions from code review Co-authored-by: compilade <[email protected]> --------- Co-authored-by: compilade <[email protected]>

* 9B - query_pre_attn_scalar = 256 not 224 See google/gemma_pytorch@03e6575 Gemma 9b should use 256 and not 224 (self.config.hidden_size // self.config.num_attention_heads) * llama : fix Gemma-2 Query scaling factor ggml-ci --------- Co-authored-by: Daniel Han <[email protected]>

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03) → 'github:NixOS/nixpkgs/7e7c39ea35c5cdd002cd4588b03a3fb9ece6fad9?narHash=sha256-EYekUHJE2gxeo2pM/zM9Wlqw1Uw2XTJXOSAO79ksc4Y%3D' (2024-07-12) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…anov#8474) * pydantic : replace uses of __annotations__ with get_type_hints * pydantic : fix Python 3.9 and 3.10 support

* Fix incoherence by adding missing LOAD_VEC_A parameter * Fix Vulkan op result checker build error

* add concat through dim 1/2

Fixes a few links to within the repo that were broken in the reorganization of the documentation in ggerganov#8325.

…rganov#8472) The README.md had a stale information. In particular, the --ctx-size "defaults to 512" confused me and I had to check the code to confirm this was false. This the server is evolving rapidly, it's probably better to keep the source of truth at a single place (in the source) and generate the README.md based on that. Did: make llama-server ./llama-server --help > t.txt vimdiff t.txt examples/server/README.md I copied the content inside a backquote block. I would have preferred proper text but it would require a fair amount of surgery to make the current output compatible with markdown. A follow up could be to automate this process with a script. No functional change.

This commit adds a macro guard to pragma GCC to avoid the following warning on windows: ```console C:\llama.cpp\ggml\src\ggml-aarch64.c(17,9): warning C4068: unknown pragma 'GCC' [C:\lama.cpp\build\ggml\src\ggml.vcxproj] ```

HanClinto and others added 6 commits July 2, 2024 12:18

Removes multiple newlines at the end of files that is breaking the ed…

07a3fc0

…itorconfig step of CI. (ggerganov#8258)

Adding step to clean target to remove legacy binary names to reduce…

3e2618b

… upgrade / migration confusion arising from ggerganov#7809. (ggerganov#8257)

fix: add missing short command line argument -mli for multiline-input (…

a27152b

…ggerganov#8261)

caching device_info in device_ext to avoid extra queries + wg size ge…

b0536ed

…tter wrapper in dev_mgr

Dequant improvements rebase (ggerganov#8255)

fadde67

* Single load for half2 * Store scales in local mem * Vec load quantized values

minor updates to device_ext

6d5b0b4

github-actions bot added SYCL ggml labels Jul 3, 2024

foldl and others added 10 commits July 3, 2024 14:40

fix typo (ggerganov#8267)

f8d6a23

Co-authored-by: Judd <[email protected]>

fix phi 3 conversion (ggerganov#8262)

916248a

ppl : fix n_seq_max for perplexity (ggerganov#8277)

5f2d4e6

* ppl : fix n_seq_max for perplexity * use 1 seq for kl_divergence

Define and optimize RDNA1 (ggerganov#8085)

d23287f

[SYCL] Remove unneeded semicolons (ggerganov#8280)

f619024

convert : fix gemma v1 tokenizer convert (ggerganov#8248)

20fc380

ggml-ci

Merge branch 'master' into dev_ext_wg_query

985f03d

github-actions bot added Apple Metal Nvidia GPU Vulkan testing build examples devops python script server labels Jul 4, 2024

ditsuke added 2 commits July 4, 2024 15:39

build(python): Package scripts with pip-0517 compliance

b0a4699

fix: Actually include scripts in build

b1c3f26

Not namespaced though :(

ggerganov and others added 28 commits July 11, 2024 11:20

gitignore : deprecated binaries

a977c11

CUDA: optimize and refactor MMQ (ggerganov#8416)

808aba3

* CUDA: optimize and refactor MMQ * explicit q8_1 memory layouts, add documentation

ggml : add NVPL BLAS support (ggerganov#8329) (ggerganov#8425)

3686456

* ggml : add NVPL BLAS support * ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>` --------- Co-authored-by: ntukanov <[email protected]>

ggml : minor naming changes (ggerganov#8433)

370b1f7

* ggml : minor naming changes ggml-ci * ggml : use PRId64 [no ci] * ggml : revert FA K/Q names

examples : sprintf -> snprintf (ggerganov#8434)

71c1121

* examples : sprintf -> snprintf ggml-ci * examples : use sizeof() instead of hardcoded constants

convert : remove fsep token from GPTRefactForCausalLM (ggerganov#8237)

5aefbce

The <filename> token used by Refact doesn't serve the same purpose as the <file_separator> from CodeGemma. Signed-off-by: Jiri Podivin <[email protected]>

docker : fix filename for convert-hf-to-gguf.py in tools.sh (ggergano…

8a4441e

…v#8441) Commit b0a4699 changed the name of this script from convert-hf-to-gguf.py to convert_hf_to_gguf.py breaking how convert is called from within a Docker container.

server : ensure batches are either all embed or all completion (ggerg…

c3ebcfa

…anov#8420) * make sure batches are all embed or all non-embed * non-embedding batch for sampled tokens; fix unused params warning

main : print error on empty input (ggerganov#8456)

6af51c0

server : handle content array in chat API (ggerganov#8449)

4e24cff

* server : handle content array in chat API * Update examples/server/utils.hpp Co-authored-by: Xuan Son Nguyen <[email protected]> --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

metal : template-ify some of the kernels (ggerganov#8447)

c917b67

ggml-ci

gguf_hash.py: Add sha256 (ggerganov#8470)

e236528

* gguf_hash.py: Add sha256 * gguf_hash.py: rename string UUIDv5 --> uuid * Apply suggestions from code review Co-authored-by: compilade <[email protected]> --------- Co-authored-by: compilade <[email protected]>

pydantic : replace uses of __annotations__ with get_type_hints (ggerg…

090fca7

…anov#8474) * pydantic : replace uses of __annotations__ with get_type_hints * pydantic : fix Python 3.9 and 3.10 support

Vulkan MMQ Fix (ggerganov#8479)

bda62d7

* Fix incoherence by adding missing LOAD_VEC_A parameter * Fix Vulkan op result checker build error

llama : de-duplicate deepseek2 norm

3dfda05

[SYCL] add concat through dim 1/2 (ggerganov#8483)

16bdfa4

* add concat through dim 1/2

docs: fix links in development docs [no ci] (ggerganov#8481)

fc690b0

Fixes a few links to within the repo that were broken in the reorganization of the documentation in ggerganov#8325.

common : add --no-cont-batching arg (ggerganov#6358)

9104bc2

Merge branch 'master' into dev_ext_wg_query

c3c57fb

github-actions bot added documentation Improvements or additions to documentation nix labels Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

draft : Caching device_info in device_ext #3

draft : Caching device_info in device_ext #3

OuadiElfarouki commented Jul 3, 2024

draft : Caching device_info in device_ext #3

Are you sure you want to change the base?

draft : Caching device_info in device_ext #3

Conversation

OuadiElfarouki commented Jul 3, 2024