Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Smaug 70B support to conversion #7402

Merged
merged 3 commits into from
May 26, 2024
Merged

Conversation

bartowski1182
Copy link
Contributor

Converted, quanted, and ran with no issue with these changes

https://huggingface.co/abacusai/Smaug-Llama-3-70B-Instruct

@github-actions github-actions bot added the python python script changes label May 20, 2024
@mofosyne mofosyne added model Model specific Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix enhancement New feature or request labels May 20, 2024
@bartowski1182
Copy link
Contributor Author

Do I need to be at all concerned about the tests that got cancelled/skipped?

@ggerganov
Copy link
Owner

I noticed that the pre-tokenizer config of this model is slightly different from DBRX. The regex is the same, but the other config params differ. Should make sure that they are compatible by running more extensive tokenization tests

@bartowski1182
Copy link
Contributor Author

Any tips on what can be run to accomplish this or where I can search myself for more info?

@ggerganov
Copy link
Owner

ggerganov commented May 21, 2024

I usually create a vocab and then run tokenization on wikitext:

python3 convert-hf-to-gguf.py models/tokenizers/dbrx/ --outfile models/ggml-vocab-dbrx.gguf --vocab-only

./tests/test-tokenizer-0.sh dbrx ./build/wikitext-2-raw/wiki.train.raw

If all is good, it should output Tokenization is correct! which would mean that AutoTokenizer produced the same result as llama.cpp

Though this test is not exhaustive and even if it passes it does not guarantee correctness. @jaime-m-p has been recently doing significant improvements to the tokenization tests, but I haven't studied in details yet #7375

@bartowski1182
Copy link
Contributor Author

Okay I'll give that a look!

Only reason I clumped it in with DBRX was because it seemed it only needed the regex support same as DBRX, hopefully tests will help show if that's true

@bartowski1182
Copy link
Contributor Author

bartowski1182 commented May 21, 2024

@ggerganov I'm hoping I did this correctly:

python3 ./tests/test-tokenizer-0.py /models/Smaug-Llama-3-70B-Instruct/ --fname-tok /training_data/wikitext-2-raw/wiki.train.raw
python3 convert-hf-to-gguf.py --outfile ./models/ggml-vocab-smaug-bpe.gguf --vocab-only /models/Smaug-Llama-3-70B-Instruct/
./test-tokenizer-0.sh smaug-bpe /training_data/wikitext-2-raw/wiki.train.raw
make: 'tests/test-tokenizer-0' is up to date.
Testing smaug-bpe on /training_data/wikitext-2-raw/wiki.train.raw ...
main : tokenized in 13682.403 ms (cpp)
Tokenization is correct!

@ggerganov
Copy link
Owner

It didn't run the Python tokenizer. Do you have a folder "./models/tokenizers/smaug-bpe" ?

@bartowski1182
Copy link
Contributor Author

I ran it manually ahead of time because my models are in weird locations

the first line I posted is me manually running the python tokenizer:

python3 ./tests/test-tokenizer-0.py /models/Smaug-Llama-3-70B-Instruct/ --fname-tok /training_data/wikitext-2-raw/wiki.train.raw

@bartowski1182
Copy link
Contributor Author

the line in the test itself fails due to my weird model locations but the diff would also fail if the file didn't exist, so since I made it before hand it exists and matches

@ggerganov
Copy link
Owner

Ah yes, thats' correct. So this is good, but keep in mind there could be surprises with the tokenization in some edge cases

@jaime-m-p
Copy link
Collaborator

I'm doing some testing.

There is one more thing to change, but not sure how to solve this.

diff --git a/llama.cpp b/llama.cpp
index d26fe559..61db1ac1 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -12566,7 +12570,7 @@ static std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab &
             } break;
         case LLAMA_VOCAB_TYPE_BPE:
             {
-                if (add_special && vocab.special_add_bos != 0) {
+                if (add_special && vocab.special_add_bos == 1) {
                     GGML_ASSERT(vocab.special_bos_id != -1);
                     output.push_back(vocab.special_bos_id);
                 }
@@ -12585,7 +12589,7 @@ static std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab &
                     }
                 }
 
-                if (add_special && vocab.special_add_bos != 0 && output.size() >= 2 && output[1] == vocab.special_bos_id) {
+                if (add_special && vocab.special_add_bos == 0 && output.size() >= 2 && output[1] == vocab.special_bos_id) {
                     LLAMA_LOG_WARN(
                         "%s: Added a BOS token to the prompt as specified by the model but the prompt "
                         "also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. "

By default vocab.special_add_bos is -1 if it is not configured in *.json.
Neither smaug-bpe or llama-bpe do, so in both cases are -1.
If assuming -1 means True, llama-bpe works and smaug-bpe fails.
If assuming -1 means False, llama-bpe fails and smaug-bpe works.

I guess we need to determine early what this undefined -1 sould collapse to 0/1.

teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 23, 2024
@bartowski1182
Copy link
Contributor Author

@jaime-m-p any recommendations? is this okay to merge in the interim or should that issue be resolved?

@jaime-m-p
Copy link
Collaborator

I did some stats:

"gpt-2"           special_add_bos -1/0 FAIL   special_add_eos -1/0 OK
"llama-bpe"       special_add_bos -1/1 OK     special_add_eos -1/0 OK
"jina-v2-es"      special_add_bos  1/1 OK     special_add_eos  1/1 OK
"jina-v2-de"      special_add_bos  1/1 OK     special_add_eos  1/1 OK
"starcoder"       special_add_bos -1/0 FAIL   special_add_eos -1/0 OK
"deepseek-llm"    special_add_bos  1/1 OK     special_add_eos  0/0 OK
"deepseek-coder"  special_add_bos  1/1 OK     special_add_eos  0/0 OK
"falcon"          special_add_bos -1/0 FAIL   special_add_eos -1/0 OK

The a/b means: a current value, b expected value.
special_add_bos -1/0 is FAIL because -1 != 0 (True), and the exped value is 0 (False).
special_add_eos -1/0 is OK because -1 == 1 (False), and the exped value is 0 (False).

Not that exaustive, but averge says that -1 should be treated as 0.
Seems llama-bpe is reversed.

I started another PR #7530 trying to solve this among other issues for BPE tokenizer.

@jaime-m-p any recommendations? is this okay to merge in the interim or should that issue be resolved?

Not sure what is the protocol, but I think if you commit, it eventually will work as expected when I finish the PR.

@mofosyne mofosyne mentioned this pull request May 25, 2024
9 tasks
@bartowski1182
Copy link
Contributor Author

@ggerganov Think this is good to merge based on jaime's comment here: #7530 (comment)

@ggerganov ggerganov merged commit c429b33 into ggerganov:master May 26, 2024
71 of 74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request model Model specific python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants