Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Tokenizers with Byte Tokens #1153

Merged
merged 1 commit into from
Sep 17, 2024
Merged

Enable Tokenizers with Byte Tokens #1153

merged 1 commit into from
Sep 17, 2024

Conversation

lapp0
Copy link
Contributor

@lapp0 lapp0 commented Sep 14, 2024

Fixes #1038

Problem

Qwen directly uses bytes as tokens for "incomplete unicode". We don't handle type(token) == byte in fsm.py.

For "incomplete unicode" handling with llama and gpt2 style BPE tokenizers, fsm.regex.reduced_vocabulary() looks for a string prefix indicating it's incomplete unicode (e.g. \ufffd), converts the token string to bytes, and strips the prefix "incomplete unicode indicator" bytes to get the token_bytes.

Change

For this PR's "incomplete unicode" handing for "Qwen-style" BPE, the token is already bytes, so if isinstance(token_str, bytes): token_bytes = token`.

@lapp0 lapp0 changed the title enable byte-tokens Enable Tokenizers with Byte Tokens Sep 14, 2024
@lapp0 lapp0 force-pushed the fix-bpe branch 3 times, most recently from a4a0c6d to 0909276 Compare September 14, 2024 20:40
@lapp0 lapp0 marked this pull request as ready for review September 14, 2024 20:40
@rlouf rlouf merged commit 2b1aed0 into dottxt-ai:main Sep 17, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TypeError when calling glm-4-9b-chat (cannot use a string pattern on a bytes-like object)
2 participants