Enable Tokenizers with Byte Tokens #1153

lapp0 · 2024-09-14T20:13:16Z

Problem

Qwen directly uses bytes as tokens for "incomplete unicode". We don't handle type(token) == byte in fsm.py.

For "incomplete unicode" handling with llama and gpt2 style BPE tokenizers, fsm.regex.reduced_vocabulary() looks for a string prefix indicating it's incomplete unicode (e.g. \ufffd), converts the token string to bytes, and strips the prefix "incomplete unicode indicator" bytes to get the token_bytes.

Change

For this PR's "incomplete unicode" handing for "Qwen-style" BPE, the token is already bytes, so if isinstance(token_str, bytes): token_bytes = token`.

lapp0 changed the title ~~enable byte-tokens~~ Enable Tokenizers with Byte Tokens Sep 14, 2024

lapp0 mentioned this pull request Sep 14, 2024

TypeError when calling glm-4-9b-chat (cannot use a string pattern on a bytes-like object) #1038

Closed

lapp0 force-pushed the fix-bpe branch 3 times, most recently from a4a0c6d to 0909276 Compare September 14, 2024 20:40

lapp0 marked this pull request as ready for review September 14, 2024 20:40

lapp0 mentioned this pull request Sep 15, 2024

generate.regex() fails to generate regex-constrained text with madlad400-3b-mt T5 model #1133

Open

enable actual-byte tokens in reduced_vocabulary

322f96f

lapp0 force-pushed the fix-bpe branch from 0909276 to 322f96f Compare September 16, 2024 18:19

rlouf merged commit 2b1aed0 into dottxt-ai:main Sep 17, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Tokenizers with Byte Tokens #1153

Enable Tokenizers with Byte Tokens #1153

lapp0 commented Sep 14, 2024 •

edited

Loading

Enable Tokenizers with Byte Tokens #1153

Enable Tokenizers with Byte Tokens #1153

Conversation

lapp0 commented Sep 14, 2024 • edited Loading

Problem

Change

lapp0 commented Sep 14, 2024 •

edited

Loading