Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer BPE fixes #7530

Merged
merged 29 commits into from
Jun 18, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
e013b23
Update random test: add_bos_token
May 24, 2024
55e387b
Add BPE models for testing
May 24, 2024
fe3c531
bugfix: custom regex split fails with codepoint 0
May 25, 2024
6f4c300
Refactor llm_tokenizer_bpe: move code to constructor
May 25, 2024
614d0bb
Update random test: add_eos_token
May 25, 2024
6168399
Add BPE models for testing
May 25, 2024
0794b77
Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
May 25, 2024
51e933a
Fix falcon punctuation regex
May 25, 2024
1d2f3ad
Better name functions to append token/bos/eos
May 25, 2024
c83ea1a
Move tokenizer flags to vocab structure.
May 25, 2024
615f425
Allow lstrip for 'added_tokens'
May 25, 2024
f84b04f
Default values for special_add_bos/eos
May 25, 2024
7a5578f
Fix default value for WPM special_add_eos
May 25, 2024
173ab69
Better variable names
May 26, 2024
fef9915
Build vocab.special_tokens_cache using vocab token types
May 29, 2024
d67de1a
Merge commit '148995e5' into tokenizer-bpe-fixes
Jun 11, 2024
c863752
Generalize 'jina-v2' per token attributes
Jun 12, 2024
75840fe
Fix merge: 'smaug'
Jun 12, 2024
f58de31
update brute force random test
Jun 13, 2024
974d40b
Fix 'jina-v2' per token attributes
Jun 13, 2024
07530a8
Fix unicode whitespaces (deepseek-coder)
Jun 13, 2024
4ff15d4
Fix unicode whitespaces (deepseek-llm)
Jun 14, 2024
0575023
Skip missing byte tokens (falcon)
Jun 14, 2024
8cda5af
Update brute force random test
Jun 14, 2024
4af5478
Better unicode data generation
Jun 14, 2024
e28d0e4
Merge branch 'master' into tokenizer-bpe-fixes
Jun 15, 2024
903e47f
Fix merge: renamed and deleted files
Jun 15, 2024
b7ee827
Replace char32_t with uint32_t
Jun 17, 2024
b8929d5
Merge branch 'master' into tokenizer-bpe-fixes
jaime-m-p Jun 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Loading