Release Rust v0.9.0 · huggingface/tokenizers

Only one progress bar while reading files during training. This is better for use-cases with
a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
size of each file before starting to actually read these files, as this process could take really
long.
[#190]: Improved BPE and WordPiece builders
[#193]: encode and encode_batch now take a new argument, specifying whether we should add the
special tokens
[#197]: The NormalizedString has been removed from the Encoding. It is now possible to
retrieve it by calling normalize on the Tokenizer. This brings a reduction of 70% of the memory
footprint
[#197]: The NormalizedString API has been improved. It is now possible to retrieve parts of both
strings using both "normalized" or "original" offsets
[#197]: The offsets provided on Encoding are now relative to the original string, and not the
normalized one anymore
AddedToken are now used for both add_special_tokens and add_tokens. Also, these AddedToken
have more options to allow various behaviors.

[#188]: impl PostProcessor for ByteLevel: Handles trimming the offsets if activated. This avoids
the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are
part of the actual token
More alignment mappings on the Encoding.
post_process can be called on the Tokenizer

[#193]: Fix some issues with the offsets being wrong with the ByteLevel BPE:
- when add_prefix_space is activated
- [#156]: when a Unicode character gets split-up in multiple byte-level characters
Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
[#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not
advised, but that's not the question)

Add the ByteLevel PostProcessor to your byte-level BPE tokenizers if relevant.

Provide feedback