Add Smaug 70B support to conversion #7402

bartowski1182 · 2024-05-20T00:44:17Z

Converted, quanted, and ran with no issue with these changes

https://huggingface.co/abacusai/Smaug-Llama-3-70B-Instruct

bartowski1182 · 2024-05-20T21:10:13Z

Do I need to be at all concerned about the tests that got cancelled/skipped?

ggerganov · 2024-05-21T08:38:17Z

I noticed that the pre-tokenizer config of this model is slightly different from DBRX. The regex is the same, but the other config params differ. Should make sure that they are compatible by running more extensive tokenization tests

bartowski1182 · 2024-05-21T13:44:35Z

Any tips on what can be run to accomplish this or where I can search myself for more info?

ggerganov · 2024-05-21T14:16:19Z

I usually create a vocab and then run tokenization on wikitext:

python3 convert-hf-to-gguf.py models/tokenizers/dbrx/ --outfile models/ggml-vocab-dbrx.gguf --vocab-only

./tests/test-tokenizer-0.sh dbrx ./build/wikitext-2-raw/wiki.train.raw

If all is good, it should output Tokenization is correct! which would mean that AutoTokenizer produced the same result as llama.cpp

Though this test is not exhaustive and even if it passes it does not guarantee correctness. @jaime-m-p has been recently doing significant improvements to the tokenization tests, but I haven't studied in details yet #7375

bartowski1182 · 2024-05-21T15:54:48Z

Okay I'll give that a look!

Only reason I clumped it in with DBRX was because it seemed it only needed the regex support same as DBRX, hopefully tests will help show if that's true

bartowski1182 · 2024-05-21T16:45:12Z

@ggerganov I'm hoping I did this correctly:

python3 ./tests/test-tokenizer-0.py /models/Smaug-Llama-3-70B-Instruct/ --fname-tok /training_data/wikitext-2-raw/wiki.train.raw

python3 convert-hf-to-gguf.py --outfile ./models/ggml-vocab-smaug-bpe.gguf --vocab-only /models/Smaug-Llama-3-70B-Instruct/

./test-tokenizer-0.sh smaug-bpe /training_data/wikitext-2-raw/wiki.train.raw
make: 'tests/test-tokenizer-0' is up to date.
Testing smaug-bpe on /training_data/wikitext-2-raw/wiki.train.raw ...
main : tokenized in 13682.403 ms (cpp)
Tokenization is correct!

ggerganov · 2024-05-21T16:56:20Z

It didn't run the Python tokenizer. Do you have a folder "./models/tokenizers/smaug-bpe" ?

bartowski1182 · 2024-05-21T17:01:21Z

I ran it manually ahead of time because my models are in weird locations

the first line I posted is me manually running the python tokenizer:

python3 ./tests/test-tokenizer-0.py /models/Smaug-Llama-3-70B-Instruct/ --fname-tok /training_data/wikitext-2-raw/wiki.train.raw

bartowski1182 · 2024-05-21T17:02:51Z

the line in the test itself fails due to my weird model locations but the diff would also fail if the file didn't exist, so since I made it before hand it exists and matches

ggerganov · 2024-05-21T17:09:08Z

Ah yes, thats' correct. So this is good, but keep in mind there could be surprises with the tokenization in some edge cases

jaime-m-p · 2024-05-21T20:08:19Z

I'm doing some testing.

There is one more thing to change, but not sure how to solve this.

diff --git a/llama.cpp b/llama.cpp
index d26fe559..61db1ac1 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -12566,7 +12570,7 @@ static std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab &
             } break;
         case LLAMA_VOCAB_TYPE_BPE:
             {
-                if (add_special && vocab.special_add_bos != 0) {
+                if (add_special && vocab.special_add_bos == 1) {
                     GGML_ASSERT(vocab.special_bos_id != -1);
                     output.push_back(vocab.special_bos_id);
                 }
@@ -12585,7 +12589,7 @@ static std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab &
                     }
                 }
 
-                if (add_special && vocab.special_add_bos != 0 && output.size() >= 2 && output[1] == vocab.special_bos_id) {
+                if (add_special && vocab.special_add_bos == 0 && output.size() >= 2 && output[1] == vocab.special_bos_id) {
                     LLAMA_LOG_WARN(
                         "%s: Added a BOS token to the prompt as specified by the model but the prompt "
                         "also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. "

By default vocab.special_add_bos is -1 if it is not configured in *.json.
Neither smaug-bpe or llama-bpe do, so in both cases are -1.
If assuming -1 means True, llama-bpe works and smaug-bpe fails.
If assuming -1 means False, llama-bpe fails and smaug-bpe works.

I guess we need to determine early what this undefined -1 sould collapse to 0/1.

bartowski1182 · 2024-05-24T20:46:50Z

@jaime-m-p any recommendations? is this okay to merge in the interim or should that issue be resolved?

jaime-m-p · 2024-05-25T04:13:16Z

I did some stats:

"gpt-2"           special_add_bos -1/0 FAIL   special_add_eos -1/0 OK
"llama-bpe"       special_add_bos -1/1 OK     special_add_eos -1/0 OK
"jina-v2-es"      special_add_bos  1/1 OK     special_add_eos  1/1 OK
"jina-v2-de"      special_add_bos  1/1 OK     special_add_eos  1/1 OK
"starcoder"       special_add_bos -1/0 FAIL   special_add_eos -1/0 OK
"deepseek-llm"    special_add_bos  1/1 OK     special_add_eos  0/0 OK
"deepseek-coder"  special_add_bos  1/1 OK     special_add_eos  0/0 OK
"falcon"          special_add_bos -1/0 FAIL   special_add_eos -1/0 OK

The a/b means: a current value, b expected value.
special_add_bos -1/0 is FAIL because -1 != 0 (True), and the exped value is 0 (False).
special_add_eos -1/0 is OK because -1 == 1 (False), and the exped value is 0 (False).

Not that exaustive, but averge says that -1 should be treated as 0.
Seems llama-bpe is reversed.

I started another PR #7530 trying to solve this among other issues for BPE tokenizer.

@jaime-m-p any recommendations? is this okay to merge in the interim or should that issue be resolved?

Not sure what is the protocol, but I think if you commit, it eventually will work as expected when I finish the PR.

bartowski1182 · 2024-05-25T23:21:28Z

@ggerganov Think this is good to merge based on jaime's comment here: #7530 (comment)

bartowski1182 added 3 commits May 19, 2024 11:09

Add Smaug support

a29c197

Merge branch 'ggerganov:master' into master

063b0d4

Merge branch 'ggerganov:master' into master

b442ab0

github-actions bot added the python python script changes label May 20, 2024

mofosyne added model Model specific Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix enhancement New feature or request labels May 20, 2024

ggerganov added a commit that referenced this pull request May 21, 2024

tests : test-tokenizer-0.sh print more info (#7402)

c3f8d58

ggerganov approved these changes May 21, 2024

View reviewed changes

teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 23, 2024

tests : test-tokenizer-0.sh print more info (ggerganov#7402)

d17d582

mofosyne mentioned this pull request May 25, 2024

Tokenizer BPE fixes #7530

Merged

9 tasks

ggerganov merged commit c429b33 into ggerganov:master May 26, 2024
71 of 74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Smaug 70B support to conversion #7402

Add Smaug 70B support to conversion #7402

bartowski1182 commented May 20, 2024

bartowski1182 commented May 20, 2024

ggerganov commented May 21, 2024

bartowski1182 commented May 21, 2024

ggerganov commented May 21, 2024 •

edited

Loading

bartowski1182 commented May 21, 2024

bartowski1182 commented May 21, 2024 •

edited

Loading

ggerganov commented May 21, 2024

bartowski1182 commented May 21, 2024

bartowski1182 commented May 21, 2024

ggerganov commented May 21, 2024

jaime-m-p commented May 21, 2024

bartowski1182 commented May 24, 2024

jaime-m-p commented May 25, 2024

bartowski1182 commented May 25, 2024

Add Smaug 70B support to conversion #7402

Add Smaug 70B support to conversion #7402

Conversation

bartowski1182 commented May 20, 2024

bartowski1182 commented May 20, 2024

ggerganov commented May 21, 2024

bartowski1182 commented May 21, 2024

ggerganov commented May 21, 2024 • edited Loading

bartowski1182 commented May 21, 2024

bartowski1182 commented May 21, 2024 • edited Loading

ggerganov commented May 21, 2024

bartowski1182 commented May 21, 2024

bartowski1182 commented May 21, 2024

ggerganov commented May 21, 2024

jaime-m-p commented May 21, 2024

bartowski1182 commented May 24, 2024

jaime-m-p commented May 25, 2024

bartowski1182 commented May 25, 2024

ggerganov commented May 21, 2024 •

edited

Loading

bartowski1182 commented May 21, 2024 •

edited

Loading