Releases: huggingface/tokenizers
Releases · huggingface/tokenizers
Python v0.9.0
Fixed
- [#362]: Fix training deadlock with Python components.
- [#363]: Fix a crash when calling
.train
with some non-existent files - [#355]: Remove a lot of possible crashes
- [#389]: Improve truncation (crash and consistency)
Added
- [#379]: Add the ability to call
encode
/encode_batch
with numpy arrays - [#292]: Support for the Unigram algorithm
- [#378], [#394], [#416], [#417]: Many new Normalizer and PreTokenizer
- [#403]: Add
TemplateProcessing
PostProcessor
. - [#420]: Ability to fuse the "unk" token in BPE.
Changed
Python v0.9.0.rc1
Fixed
- [#362]: Fix training deadlock with Python components.
- [#363]: Fix a crash when calling
.train
with some non-existent files - [#355]: Remove a lot of possible crashes
- [#389]: Improve truncation (crash and consistency)
Added
- [#379]: Add the ability to call
encode
/encode_batch
with numpy arrays - [#292]: Support for the Unigram algorithm
- [#378], [#394], [#416], [#417]: Many new Normalizer and PreTokenizer
- [#403]: Add
TemplateProcessing
PostProcessor
. - [#420]: Ability to fuse the "unk" token in BPE.
Changed
Python 0.8.1
Python v0.8.0
Highlights of this release
- We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when
processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps
while applying labels to each word. - Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later
load it back with just one line of code. That's what sharing a Tokenizer means now: 1 line of code. - With the serialization comes the compatibility with
Pickle
! The Tokenizer, all of its components,
Encodings, everything can be pickled! - Training a tokenizer is now even faster (up to 5-10x) than before!
- Compatibility with
multiprocessing
, even when using thefork
start method. Since this library
makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization,
this led to problems (deadlocks) when used withmultiprocessing
. This version now allows to
disable the parallelism, and will warn you if this is necessary. - And a lot of other improvements, and fixes.
Fixed
- [#286]: Fix various crash when training a BPE model
- [#309]: Fixed a few bugs related to additional vocabulary/tokens
Added
- [#272]: Serialization of the
Tokenizer
and all the parts (PreTokenizer
,Normalizer
, ...).
This adds some methods to easily save/load an entire tokenizer (from_str
,from_file
). - [#273]:
Tokenizer
and its parts are now pickable - [#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure
activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with
enable_padding(pad_to_multiple_of=8)
for example. - [#298]: Ability to get the currently set truncation/padding params
- [#311]: Ability to enable/disable the parallelism using the
TOKENIZERS_PARALLELISM
environment
variable. This is especially usefull when usingmultiprocessing
capabilities, with thefork
start method, which happens to be the default on Linux systems. Without disabling the parallelism,
the process dead-locks while encoding. (Cf [#187] for more information)
Changed
- Improved errors generated during truncation: When the provided max length is too low are
now handled properly. - [#249]
encode
andencode_batch
now accept pre-tokenized inputs. When the input is pre-tokenized,
the argumentis_pretokenized=True
must be specified. - [#276]: Improve BPE training speeds, by reading files sequentially, but parallelizing the
processing of each file - [#280]: Use
onig
for byte-level pre-tokenization to remove all the differences with the original
implementation from GPT-2 - [#309]: Improved the management of the additional vocabulary. This introduces an option
normalized
, controlling whether a token should be extracted from the normalized version of the
input text.
Python v0.7.0
Changed
- Only one progress bar while reading files during training. This is better for use-cases with
a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
size of each file before starting to actually read these files, as this process could take really
long. - [#193]:
encode
andencode_batch
now take a new optional argument, specifying whether we
should add the special tokens. This is activated by default. - [#197]:
original_str
andnormalized_str
have been removed from theEncoding
returned by
encode
andencode_batch
. This brings a reduction of 70% of the memory footprint. - [#197]: The offsets provided on
Encoding
are now relative to the original string, and not the
normalized one anymore. - The added token given to
add_special_tokens
oradd_tokens
on aTokenizer
, or while using
train(special_tokens=...)
can now be instances ofAddedToken
to provide more control over these
tokens. - [#136]: Updated Pyo3 version
- [#136]: Static methods
Model.from_files
andModel.empty
are removed in favor of using
constructors. - [#239]:
CharBPETokenizer
now corresponds to OpenAI GPT BPE implementation by default.
Added
- [#188]:
ByteLevel
is also aPostProcessor
now and handles trimming the offsets if activated.
This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
whitespaces are part of the actual token.
It has been added toByteLevelBPETokenizer
but it is off by default (trim_offsets=False
). - [#236]:
RobertaProcessing
also handles trimming the offsets. - [#234]: New alignment mappings on the
Encoding
. Provide methods to easily convert betweenchar
orword
(input space) andtoken
(output space). post_process
can be called on theTokenizer
- [#208]: Ability to retrieve the vocabulary from the
Tokenizer
with
get_vocab(with_added_tokens: bool)
- [#136] Models can now be instantiated through object constructors.
Fixed
- [#193]: Fix some issues with the offsets being wrong with the
ByteLevel
BPE:- when
add_prefix_space=True
- [#156]: when a Unicode character gets split-up in multiple byte-level characters
- when
- Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
- [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if
not advised, but that's not the question). - [#205]: Trim the decoded string in
BPEDecoder
used byCharBPETokenizer
How to migrate
- Add the
ByteLevel
PostProcessor
to your byte-level BPE tokenizers if relevant. If you are
usingByteLevelBPETokenizer
, this option is disabled by default (trim_offsets=False
). BertWordPieceTokenizer
option toadd_special_tokens
must now be given toencode
or
encode_batch
- Access to the
original_str
on theEncoding
has been removed. The original string is the input
ofencode
so it didn't make sense to keep it here. - No need to call
original_str.offsets(offsets[N])
to convert offsets to the original string. They
are now relative to the original string by default. - Access to the
normalized_str
on theEncoding
has been removed. Can be retrieved by calling
normalize(sequence)
on theTokenizer
- Change
Model.from_files
andModel.empty
to use constructor. The model constructor should take
the same arguments as the old methods. (ieBPE(vocab, merges)
orBPE()
) - If you were using the
CharBPETokenizer
and want to keep the same behavior as before, set
bert_normalizer=False
andsplit_on_whitespace_only=True
.
Rust v0.10.1
Fixed
- [#226]: Fix the word indexes when there are special tokens
Rust v0.10.0
Rust v0.9.0
Changed
- Only one progress bar while reading files during training. This is better for use-cases with
a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
size of each file before starting to actually read these files, as this process could take really
long. - [#190]: Improved BPE and WordPiece builders
- [#193]:
encode
andencode_batch
now take a new argument, specifying whether we should add the
special tokens - [#197]: The
NormalizedString
has been removed from theEncoding
. It is now possible to
retrieve it by callingnormalize
on theTokenizer
. This brings a reduction of 70% of the memory
footprint - [#197]: The
NormalizedString
API has been improved. It is now possible to retrieve parts of both
strings using both "normalized" or "original" offsets - [#197]: The offsets provided on
Encoding
are now relative to the original string, and not the
normalized one anymore AddedToken
are now used for bothadd_special_tokens
andadd_tokens
. Also, these AddedToken
have more options to allow various behaviors.
Added
- [#188]:
impl PostProcessor for ByteLevel
: Handles trimming the offsets if activated. This avoids
the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are
part of the actual token - More alignment mappings on the
Encoding
. post_process
can be called on theTokenizer
Fixed
- [#193]: Fix some issues with the offsets being wrong with the
ByteLevel
BPE:- when
add_prefix_space
is activated - [#156]: when a Unicode character gets split-up in multiple byte-level characters
- when
- Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
- [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not
advised, but that's not the question)
How to migrate
- Add the
ByteLevel
PostProcessor
to your byte-level BPE tokenizers if relevant.