-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer BPE fixes #7530
Tokenizer BPE fixes #7530
Conversation
If merged, don't forget to check #7402 |
Would we ideally remake any quants from before this change? How impactful is it? |
smaug-bpe works fine now. |
Until now, none of this changes are modifying the GGUF processing. Note: this is only true for BPE tokenizer fixes, for fixing the SPM tokenizer I had to modify the |
Initialize values when loading the model vocab.
For now, only <mask> token, needed for 'jina-v2'.
@jaime-m-p would it be easier if I just opened a PR against your fork for Smaug and we get it all in at once? Or would you rather merge this then get the Smaug changes in |
llama.cpp
Outdated
// tokenizer flags | ||
bool tokenizer_add_space_prefix = true; | ||
bool tokenizer_special_add_bos = false; | ||
bool tokenizer_special_add_eos = false; | ||
bool tokenizer_ignore_merges = false; | ||
bool tokenizer_mask_lstrip = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Looks good. I'll see if I can target these member variables.
case LLAMA_VOCAB_PRE_TYPE_LLAMA3: | ||
regex_exprs = { | ||
// original regex from tokenizer.json | ||
//"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", | ||
|
||
// adapted: https://github.com/ggerganov/llama.cpp/pull/6920#issuecomment-2080233989 | ||
"(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", | ||
}; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now because they're still required, but we shouldn't require these in the future. This will be error prone. I'll be embedding these the pre_tokenizer
field instead so we can just read from them directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the variable regex_exprs
should be placed and loaded as the other tokenizer 'flags'.
default: | ||
GGML_ASSERT(false); | ||
// default regex for BPE tokenization pre-processing | ||
regex_exprs = { | ||
"[\\p{P}\\$\\+<=>\\^~\\|]+", | ||
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", | ||
"\\p{N}+", | ||
"[0-9][0-9][0-9]", | ||
}; | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't correct. This should be GPT-2 pre-tokenizer by default. Otherwise, it should use the defined pre-tokenizer.
llama.cpp
Outdated
@@ -12686,7 +12731,7 @@ static void tokenizer_st_partition(const llama_vocab & vocab, std::forward_list< | |||
|
|||
// if a fragment is text ( not yet processed ) | |||
if (fragment.type == FRAGMENT_BUFFER_VARIANT_TYPE_RAW_TEXT) { | |||
auto * raw_text = &(fragment.raw_text); | |||
auto & raw_text = fragment.raw_text; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this on the stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto & raw_text = fragment.raw_text;
expands to
const std::string & raw_text = fragment.raw_text;
It's a reference, this is only 8 bytes in the stack (like a pointer).
In this case, you can see it as sintactic sugar to (const pointer to const value):
std::string const * const raw_text = &(fragment.raw_text);
# "llama-spm", # SPM | ||
# "phi-3", # SPM | ||
# "bert-bge", # WPM | ||
# "jina-v2-en", # WPM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think these are needed. Maybe for WPM, but WPM (BERT) defaults to BPE (GPT). Still don't know enough about it, yet.
Phi and Deepseek are BPE (GPT-2). I was able to get Phi 1, 1.5, and 2 all working correctly. |
@bartowski1182 I think you can merge now, the issues are not in your added code. |
@ggerganov The model The model
Maybe I will return to this later. |
Thank you for the fixes - feel free to merge Do you have ideas how to resolve the tokenization problems with added tokens? For example, with make -j tests && ./tests/test-tokenizer-0 ./models/ggml-vocab-deepseek-coder.gguf
The reason AFAIK is that AutoTokenizer treats added tokens (such as |
@ggerganov
llama.cpp/tests/test-tokenizer-0.cpp Lines 195 to 198 in 91c188d
llama.cpp/convert-hf-to-gguf-update.py Line 314 in 91c188d
llama.cpp/tests/test-tokenizer-0.cpp
- llama_tokenize(ctx, test_kv.first, add_special);
+ llama_tokenize(ctx, test_kv.first, add_special, true); Tests passed |
@@ -766,9 +772,16 @@ std::vector<std::string> unicode_regex_split(const std::string & text, const std | |||
bpe_offsets = unicode_regex_split_stl(text_collapsed, regex_expr_collapsed, bpe_offsets); | |||
} else { | |||
// no unicode category used, we can use std::wregex directly | |||
const std::wstring wtext = unicode_wstring_from_utf8(text); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you intend to remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is still here (line 778):
Lines 777 to 783 in 37bef89
// std::wregex \s does not mach non-ASCII whitespaces, using 0x0B as fallback | |
std::wstring wtext(cpts.begin(), cpts.end()); | |
for (size_t i = 0; i < wtext.size(); ++i) { | |
if (wtext[i] > 0x7F && unicode_cpt_flags(wtext[i]).is_whitespace) { | |
wtext[i] = 0x0B; | |
} | |
} |
Since unicode_wstring_from_utf8(text)
is the same as const auto cpts = unicode_cpts_from_utf8(text)
just copy the unicode characters, then collapse the unicode whitespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply. I didn't see that (Between the lack of sleep, heat, and killer headache) 😅.
for (size_t i = 0; i < wtext.size(); ++i) { | ||
if (wtext[i] > 0x7F && unicode_cpt_flags(wtext[i]).is_whitespace) { | ||
wtext[i] = 0x0B; | ||
} | ||
} | ||
|
||
//printf("text: %s\n", text.c_str()); | ||
//printf("regex_expr: %s\n", regex_expr.c_str()); | ||
bpe_offsets = unicode_regex_split_stl(wtext, wregex_expr, bpe_offsets); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wtext
is still applied here.
"Lo": CODEPOINT_FLAG_LETTER, # Other Letter | ||
"Lt": CODEPOINT_FLAG_LETTER, # Titlecase Letter | ||
"Lu": CODEPOINT_FLAG_LETTER, # Uppercase Letter | ||
"L&": CODEPOINT_FLAG_LETTER, # Cased Letter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot find any reference to L&
. What is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fount it here: https://www.unicode.org/Public/UCD/latest/ucd/LineBreak.txt.
And the description: https://www.regular-expressions.info/unicode.html.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why! The UnicodeData.txt
doesn't have or even define this.
12:58:44 | /mnt/valerie/public/gpt
(.venv) git:(main | Δ) λ python -i gen/test.py
done!
>>> categories = set()
>>> for cpt in codepoints:
... if cpt.general_category:
... categories.add(cpt.general_category)
...
>>> categories
{'Nl', 'Pf', 'Pd', 'Co', 'Cc', 'Lu', 'Me', 'Pe', 'Sk', 'Lt', 'Cf', 'Lm', 'Lo', 'Pi', 'Zp', 'Ll', 'No', 'Pc', 'Sm', 'Mn', 'Zl', 'Zs', 'Nd', 'Mc', 'So', 'Ps', 'Po', 'Cs', 'Sc'}
>>> "L&" in categories
False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L&
is a regular expression property that's used as an alias.
The symbol "L&" indicates characters of General_Category Lu, Ll, or Lt (uppercase, lowercase, or titlecase letter).
L& as used in these comments is an alias for the derived LC value (cased letter) for the General_Category property, as documented in PropertyValueAliases.txt.
https://www.unicode.org/reports/tr44/#Comments
This seems to be more applicable to the perl compatible regular expression property.
https://perldoc.perl.org/perluniprops
I don't know. Still learning. Possible I'm missing plenty more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it will never be included:
# codepoint category flags
codepoint_flags[cpt] = UNICODE_CATEGORY_TO_FLAG[categ]
You would need to parse Special and Line Break to include special codepoint sets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All L&
unicode ranges in LineBreak.txt
are already present in UnicodeData.txt
as Ll
, Lu
...
Since it is never used, maybe we can just remove L&
from the map UNICODE_CATEGORY_TO_FLAG
.
Modifications to make BPE tokenizer match AutoTokenizer.
Tested with vocabs from models:
\u2029
:"\u2029 \uA3E4"
)