Highlights of this release
- We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when
processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps
while applying labels to each word. - Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later
load it back with just one line of code. That’s what sharing a Tokenizer means now: 1 line of code. - With the serialization comes the compatibility with
Pickle
! The Tokenizer, all of its components,
Encodings, everything can be pickled! - Training a tokenizer is now even faster (up to 5-10x) than before!
- Compatibility with
multiprocessing
, even when using thefork
start method. Since this library
makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization,
this led to problems (deadlocks) when used withmultiprocessing
. This version now allows to
disable the parallelism, and will warn you if this is necessary. - And a lot of other improvements, and fixes.
Fixed
- #286: Fix various crash when training a BPE model
- #309: Fixed a few bugs related to additional vocabulary/tokens
Added
-
#272: Serialization of the
Tokenizer
and all the parts (PreTokenizer
,Normalizer
, …).
This adds some methods to easily save/load an entire tokenizer (from_str
,from_file
). -
#273:
Tokenizer
and its parts are now pickable -
#289: Ability to pad to a multiple of a specified value. This is especially useful to ensure
activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with
enable_padding(pad_to_multiple_of=8)
for example. - #298: Ability to get the currently set truncation/padding params
-
#311: Ability to enable/disable the parallelism using the
TOKENIZERS_PARALLELISM
environment
variable. This is especially usefull when usingmultiprocessing
capabilities, with thefork
start method, which happens to be the default on Linux systems. Without disabling the parallelism,
the process dead-locks while encoding. (Cf [#187] for more information)
Changed
- Improved errors generated during truncation: When the provided max length is too low are
now handled properly. -
#249
encode
andencode_batch
now accept pre-tokenized inputs. When the input is pre-tokenized,
the argumentis_pretokenized=True
must be specified. -
#276: Improve BPE training speeds, by reading files sequentially, but parallelizing the
processing of each file -
#280: Use
onig
for byte-level pre-tokenization to remove all the differences with the original
implementation from GPT-2 -
#309: Improved the management of the additional vocabulary. This introduces an option
normalized
, controlling whether a token should be extracted from the normalized version of the
input text.