Skip to content

Patch release: Fix FIM tokenizer

Compare
Choose a tag to compare
@patrickvonplaten patrickvonplaten released this 30 May 10:37
· 96 commits to main since this release

As noticed here: https://huggingface.co/mistralai/Codestral-22B-v0.1/discussions/10

The wrong tokenizer was used for FIM. This patch release fixes that so that the following works correctly:

from mistral_common.tokens.tokenizers.base import FIMRequest
from mistral_common_private.tokens.tokenizers.mistral import MistralTokenizer
tokenizer =  MistralTokenizer.v3()
tokenized = tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b"))
assert tokenized.text == "<s>[SUFFIX]return▁a▁+▁b[PREFIX]▁def▁f("