-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Smaug 70B support to conversion #7402
Conversation
Do I need to be at all concerned about the tests that got cancelled/skipped? |
I noticed that the pre-tokenizer config of this model is slightly different from DBRX. The regex is the same, but the other config params differ. Should make sure that they are compatible by running more extensive tokenization tests |
Any tips on what can be run to accomplish this or where I can search myself for more info? |
I usually create a vocab and then run tokenization on wikitext: python3 convert-hf-to-gguf.py models/tokenizers/dbrx/ --outfile models/ggml-vocab-dbrx.gguf --vocab-only
./tests/test-tokenizer-0.sh dbrx ./build/wikitext-2-raw/wiki.train.raw If all is good, it should output Though this test is not exhaustive and even if it passes it does not guarantee correctness. @jaime-m-p has been recently doing significant improvements to the tokenization tests, but I haven't studied in details yet #7375 |
Okay I'll give that a look! Only reason I clumped it in with DBRX was because it seemed it only needed the regex support same as DBRX, hopefully tests will help show if that's true |
@ggerganov I'm hoping I did this correctly:
|
It didn't run the Python tokenizer. Do you have a folder "./models/tokenizers/smaug-bpe" ? |
I ran it manually ahead of time because my models are in weird locations the first line I posted is me manually running the python tokenizer: python3 ./tests/test-tokenizer-0.py /models/Smaug-Llama-3-70B-Instruct/ --fname-tok /training_data/wikitext-2-raw/wiki.train.raw |
the line in the test itself fails due to my weird model locations but the diff would also fail if the file didn't exist, so since I made it before hand it exists and matches |
Ah yes, thats' correct. So this is good, but keep in mind there could be surprises with the tokenization in some edge cases |
I'm doing some testing. There is one more thing to change, but not sure how to solve this. diff --git a/llama.cpp b/llama.cpp
index d26fe559..61db1ac1 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -12566,7 +12570,7 @@ static std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab &
} break;
case LLAMA_VOCAB_TYPE_BPE:
{
- if (add_special && vocab.special_add_bos != 0) {
+ if (add_special && vocab.special_add_bos == 1) {
GGML_ASSERT(vocab.special_bos_id != -1);
output.push_back(vocab.special_bos_id);
}
@@ -12585,7 +12589,7 @@ static std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab &
}
}
- if (add_special && vocab.special_add_bos != 0 && output.size() >= 2 && output[1] == vocab.special_bos_id) {
+ if (add_special && vocab.special_add_bos == 0 && output.size() >= 2 && output[1] == vocab.special_bos_id) {
LLAMA_LOG_WARN(
"%s: Added a BOS token to the prompt as specified by the model but the prompt "
"also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. " By default I guess we need to determine early what this undefined -1 sould collapse to 0/1. |
@jaime-m-p any recommendations? is this okay to merge in the interim or should that issue be resolved? |
I did some stats:
The Not that exaustive, but averge says that -1 should be treated as 0. I started another PR #7530 trying to solve this among other issues for BPE tokenizer.
Not sure what is the protocol, but I think if you commit, it eventually will work as expected when I finish the PR. |
@ggerganov Think this is good to merge based on jaime's comment here: #7530 (comment) |
Converted, quanted, and ran with no issue with these changes
https://huggingface.co/abacusai/Smaug-Llama-3-70B-Instruct