Highlights of this release
- We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when
processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps
while applying labels to each word.
- Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later
load it back with just one line of code. That’s what sharing a Tokenizer means now: 1 line of code.
- With the serialization comes the compatibility with
Pickle! The Tokenizer, all of its components,
Encodings, everything can be pickled!
- Training a tokenizer is now even faster (up to 5-10x) than before!
- Compatibility with
multiprocessing, even when using the
fork start method. Since this library
makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization,
this led to problems (deadlocks) when used with
multiprocessing. This version now allows to
disable the parallelism, and will warn you if this is necessary.
- And a lot of other improvements, and fixes.
#286: Fix various crash when training a BPE model
#309: Fixed a few bugs related to additional vocabulary/tokens
#272: Serialization of the
Tokenizer and all the parts (
This adds some methods to easily save/load an entire tokenizer (
Tokenizer and its parts are now pickable
#289: Ability to pad to a multiple of a specified value. This is especially useful to ensure
activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with
enable_padding(pad_to_multiple_of=8) for example.
#298: Ability to get the currently set truncation/padding params
#311: Ability to enable/disable the parallelism using the
variable. This is especially usefull when using
multiprocessing capabilities, with the
start method, which happens to be the default on Linux systems. Without disabling the parallelism,
the process dead-locks while encoding. (Cf [#187] for more information)
- Improved errors generated during truncation: When the provided max length is too low are
now handled properly.
encode_batch now accept pre-tokenized inputs. When the input is pre-tokenized,
is_pretokenized=True must be specified.
#276: Improve BPE training speeds, by reading files sequentially, but parallelizing the
processing of each file
onig for byte-level pre-tokenization to remove all the differences with the original
implementation from GPT-2
#309: Improved the management of the additional vocabulary. This introduces an option
normalized, controlling whether a token should be extracted from the normalized version of the