Releases · huggingface/tokenizers

09 Oct 17:41

n1t0

python-v0.9.0

3bb7946

Python v0.9.0

Fixed

[#362]: Fix training deadlock with Python components.
[#363]: Fix a crash when calling .train with some non-existent files
[#355]: Remove a lot of possible crashes
[#389]: Improve truncation (crash and consistency)

Added

[#379]: Add the ability to call encode/encode_batch with numpy arrays
[#292]: Support for the Unigram algorithm
[#378], [#394], [#416], [#417]: Many new Normalizer and PreTokenizer
[#403]: Add TemplateProcessing PostProcessor.
[#420]: Ability to fuse the "unk" token in BPE.

Changed

[#360]: Lots of improvements related to words/alignment tracking
[#426]: Improvements on error messages thanks to PyO3 0.12

Assets 2

24 Sep 14:16

n1t0

python-v0.9.0.rc1

aebf510

Python v0.9.0.rc1 Pre-release

Pre-release

Fixed

[#362]: Fix training deadlock with Python components.
[#363]: Fix a crash when calling .train with some non-existent files
[#355]: Remove a lot of possible crashes
[#389]: Improve truncation (crash and consistency)

Added

[#379]: Add the ability to call encode/encode_batch with numpy arrays
[#292]: Support for the Unigram algorithm
[#378], [#394], [#416], [#417]: Many new Normalizer and PreTokenizer
[#403]: Add TemplateProcessing PostProcessor.
[#420]: Ability to fuse the "unk" token in BPE.

Changed

[#360]: Lots of improvements related to words/alignment tracking
[#426]: Improvements on error messages thanks to PyO3 0.12

Assets 2

20 Jul 21:01

n1t0

python-v0.8.1

da0a517

Python 0.8.1

Fixed

[#333]: Fix deserialization of AddedToken, where the content was not restored properly

Changed

[#329]: Improved warning and behavior when we detect a fork
[#330]: BertNormalizer now keeps the same behavior than the original implementation when
strip_accents is not specified.

Assets 2

26 Jun 19:37

n1t0

python-v0.8.0

6349ca5

Python v0.8.0

Highlights of this release

We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when
processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps
while applying labels to each word.
Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later
load it back with just one line of code. That's what sharing a Tokenizer means now: 1 line of code.
With the serialization comes the compatibility with Pickle! The Tokenizer, all of its components,
Encodings, everything can be pickled!
Training a tokenizer is now even faster (up to 5-10x) than before!
Compatibility with multiprocessing, even when using the fork start method. Since this library
makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization,
this led to problems (deadlocks) when used with multiprocessing. This version now allows to
disable the parallelism, and will warn you if this is necessary.
And a lot of other improvements, and fixes.

Fixed

[#286]: Fix various crash when training a BPE model
[#309]: Fixed a few bugs related to additional vocabulary/tokens

Added

[#272]: Serialization of the Tokenizer and all the parts (PreTokenizer, Normalizer, ...).
This adds some methods to easily save/load an entire tokenizer (from_str, from_file).
[#273]: Tokenizer and its parts are now pickable
[#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure
activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with
enable_padding(pad_to_multiple_of=8) for example.
[#298]: Ability to get the currently set truncation/padding params
[#311]: Ability to enable/disable the parallelism using the TOKENIZERS_PARALLELISM environment
variable. This is especially usefull when using multiprocessing capabilities, with the fork
start method, which happens to be the default on Linux systems. Without disabling the parallelism,
the process dead-locks while encoding. (Cf [#187] for more information)

Changed

Improved errors generated during truncation: When the provided max length is too low are
now handled properly.
[#249] encode and encode_batch now accept pre-tokenized inputs. When the input is pre-tokenized,
the argument is_pretokenized=True must be specified.
[#276]: Improve BPE training speeds, by reading files sequentially, but parallelizing the
processing of each file
[#280]: Use onig for byte-level pre-tokenization to remove all the differences with the original
implementation from GPT-2
[#309]: Improved the management of the additional vocabulary. This introduces an option
normalized, controlling whether a token should be extracted from the normalized version of the
input text.

Assets 2

08 Apr 20:18

n1t0

python-v0.7.0

670f619

Python v0.7.0

Changed

Only one progress bar while reading files during training. This is better for use-cases with
a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
size of each file before starting to actually read these files, as this process could take really
long.
[#193]: encode and encode_batch now take a new optional argument, specifying whether we
should add the special tokens. This is activated by default.
[#197]: original_str and normalized_str have been removed from the Encoding returned by
encode and encode_batch. This brings a reduction of 70% of the memory footprint.
[#197]: The offsets provided on Encoding are now relative to the original string, and not the
normalized one anymore.
The added token given to add_special_tokens or add_tokens on a Tokenizer, or while using
train(special_tokens=...) can now be instances of AddedToken to provide more control over these
tokens.
[#136]: Updated Pyo3 version
[#136]: Static methods Model.from_files and Model.empty are removed in favor of using
constructors.
[#239]: CharBPETokenizer now corresponds to OpenAI GPT BPE implementation by default.

Added

[#188]: ByteLevel is also a PostProcessor now and handles trimming the offsets if activated.
This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
whitespaces are part of the actual token.
It has been added to ByteLevelBPETokenizer but it is off by default (trim_offsets=False).
[#236]: RobertaProcessing also handles trimming the offsets.
[#234]: New alignment mappings on the Encoding. Provide methods to easily convert between char
or word (input space) and token (output space).
post_process can be called on the Tokenizer
[#208]: Ability to retrieve the vocabulary from the Tokenizer with
get_vocab(with_added_tokens: bool)
[#136] Models can now be instantiated through object constructors.

Fixed

[#193]: Fix some issues with the offsets being wrong with the ByteLevel BPE:
- when add_prefix_space=True
- [#156]: when a Unicode character gets split-up in multiple byte-level characters
Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
[#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if
not advised, but that's not the question).
[#205]: Trim the decoded string in BPEDecoder used by CharBPETokenizer

How to migrate

Add the ByteLevel PostProcessor to your byte-level BPE tokenizers if relevant. If you are
using ByteLevelBPETokenizer, this option is disabled by default (trim_offsets=False).
BertWordPieceTokenizer option to add_special_tokens must now be given to encode or
encode_batch
Access to the original_str on the Encoding has been removed. The original string is the input
of encode so it didn't make sense to keep it here.
No need to call original_str.offsets(offsets[N]) to convert offsets to the original string. They
are now relative to the original string by default.
Access to the normalized_str on the Encoding has been removed. Can be retrieved by calling
normalize(sequence) on the Tokenizer
Change Model.from_files and Model.empty to use constructor. The model constructor should take
the same arguments as the old methods. (ie BPE(vocab, merges) or BPE())
If you were using the CharBPETokenizer and want to keep the same behavior as before, set
bert_normalizer=False and split_on_whitespace_only=True.

Assets 2

09 Apr 15:37

n1t0

rust-v0.10.1

af66d6f

Rust v0.10.1

Fixed

[#226]: Fix the word indexes when there are special tokens

Assets 2

08 Apr 18:23

n1t0

rust-v0.10.0

e946c42

Rust v0.10.0

Changed

[#222]: All Tokenizer's subparts must now be Send + Sync

Added

[#208]: Ability to retrieve the vocabulary from the Tokenizer & Model

Fixed

[#205]: Trim the decoded string in BPEDecoder
[b770f36]: Fix a bug with added tokens generated IDs

Assets 2

26 Mar 21:23

n1t0

rust-v0.9.0

ab12a7f

Rust v0.9.0

Changed

Only one progress bar while reading files during training. This is better for use-cases with
a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
size of each file before starting to actually read these files, as this process could take really
long.
[#190]: Improved BPE and WordPiece builders
[#193]: encode and encode_batch now take a new argument, specifying whether we should add the
special tokens
[#197]: The NormalizedString has been removed from the Encoding. It is now possible to
retrieve it by calling normalize on the Tokenizer. This brings a reduction of 70% of the memory
footprint
[#197]: The NormalizedString API has been improved. It is now possible to retrieve parts of both
strings using both "normalized" or "original" offsets
[#197]: The offsets provided on Encoding are now relative to the original string, and not the
normalized one anymore
AddedToken are now used for both add_special_tokens and add_tokens. Also, these AddedToken
have more options to allow various behaviors.

Added

[#188]: impl PostProcessor for ByteLevel: Handles trimming the offsets if activated. This avoids
the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are
part of the actual token
More alignment mappings on the Encoding.
post_process can be called on the Tokenizer

Fixed

[#193]: Fix some issues with the offsets being wrong with the ByteLevel BPE:
- when add_prefix_space is activated
- [#156]: when a Unicode character gets split-up in multiple byte-level characters
Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
[#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not
advised, but that's not the question)

How to migrate

Add the ByteLevel PostProcessor to your byte-level BPE tokenizers if relevant.

Assets 2

02 Mar 19:53

n1t0

rust-v0.8.0

9256ec6

Rust v0.8.0

Changes:

Big improvements in speed for BPE (Both training and tokenization) (#165)

Fixes:

Do not open all files directly while training (#163)
There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up
in multiple bytes. (cf #156)
The LongestFirst truncation strategy had a bug (#174)

Assets 2

02 Mar 20:03

n1t0

python-v0.6.0

afe96da

Python v0.6.0

Changes:

Big improvements in speed for BPE (Both training and tokenization) (#165)

Fixes:

Some default tokens were missing from BertWordPieceTokenizer (cf #160)
There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up
in multiple bytes. (cf #156)
The longest_first truncation strategy had a bug (#174)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed

Added

Changed

Fixed

Added

Changed

Fixed

Changed

Highlights of this release

Fixed

Added

Changed

Changed

Added

Fixed

How to migrate

Fixed

Changed

Added

Fixed

Changed

Added

Fixed

How to migrate

Changes:

Fixes:

Changes:

Fixes:

Releases: huggingface/tokenizers

Python v0.9.0

Fixed

Added

Changed

Python v0.9.0.rc1

Fixed

Added

Changed

Python 0.8.1

Fixed

Changed

Python v0.8.0

Highlights of this release

Fixed

Added

Changed

Python v0.7.0

Changed

Added

Fixed

How to migrate

Rust v0.10.1

Fixed

Rust v0.10.0

Changed

Added

Fixed

Rust v0.9.0

Changed

Added

Fixed

How to migrate

Rust v0.8.0

Changes:

Fixes:

Python v0.6.0

Changes:

Fixes: