[WIP] Add a dedicated tokenizer for byte level transformers #36216

apehex · 2025-02-15T13:06:20Z

What does this PR do?

This is a simple implementation of a byte tokenizer.
It is useful for models like the recent BLT from Meta.

It accepts all the encoding schemes from the built-in encode function of the standard library.
In particular, "UTF-32-BE" is useful since it allows to use fixed size patching for the bytes.

The class doesn't use a vocabulary, but I still kept the usual methods for compatibility.

Worth noting is that the special tokens use the legacy codepoints built-in Unicode.
For example the mask token is "\u001a", which is the "substitute" codepoint.
See the Unicode table.

Fixes #36202

apehex · 2025-02-15T13:08:17Z

I've written tests but I'm not sure where I should place them.
Same goes for the actual implementation.

Where do you want them, if anywhere? @ArthurZucker

apehex added 3 commits February 14, 2025 21:02

Add "ByteTokenizer" for raw byte encodings

ee021ab

Allow custom tokens + detail the docstring + fix vocab size

a3be464

Detail the docstrings + add type signatures

5d8852f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add a dedicated tokenizer for byte level transformers #36216

[WIP] Add a dedicated tokenizer for byte level transformers #36216

apehex commented Feb 15, 2025

apehex commented Feb 15, 2025 •

edited

Loading

[WIP] Add a dedicated tokenizer for byte level transformers #36216

Are you sure you want to change the base?

[WIP] Add a dedicated tokenizer for byte level transformers #36216

Conversation

apehex commented Feb 15, 2025

What does this PR do?

apehex commented Feb 15, 2025 • edited Loading

apehex commented Feb 15, 2025 •

edited

Loading