Tokenizers v0.8.0 is out!

sgugger · July 7, 2020, 6:48pm

Highlights of this release

We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when
processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps
while applying labels to each word.
Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later
load it back with just one line of code. That’s what sharing a Tokenizer means now: 1 line of code.
With the serialization comes the compatibility with Pickle! The Tokenizer, all of its components,
Encodings, everything can be pickled!
Training a tokenizer is now even faster (up to 5-10x) than before!
Compatibility with multiprocessing, even when using the fork start method. Since this library
makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization,
this led to problems (deadlocks) when used with multiprocessing. This version now allows to
disable the parallelism, and will warn you if this is necessary.
And a lot of other improvements, and fixes.

#272: Serialization of the Tokenizer and all the parts (PreTokenizer, Normalizer, …).
This adds some methods to easily save/load an entire tokenizer (from_str, from_file).
#273: Tokenizer and its parts are now pickable
#289: Ability to pad to a multiple of a specified value. This is especially useful to ensure
activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with
enable_padding(pad_to_multiple_of=8) for example.
#298: Ability to get the currently set truncation/padding params
#311: Ability to enable/disable the parallelism using the TOKENIZERS_PARALLELISM environment
variable. This is especially usefull when using multiprocessing capabilities, with the fork
start method, which happens to be the default on Linux systems. Without disabling the parallelism,
the process dead-locks while encoding. (Cf [#187] for more information)

Improved errors generated during truncation: When the provided max length is too low are
now handled properly.
#249 encode and encode_batch now accept pre-tokenized inputs. When the input is pre-tokenized,
the argument is_pretokenized=True must be specified.
#276: Improve BPE training speeds, by reading files sequentially, but parallelizing the
processing of each file
#280: Use onig for byte-level pre-tokenization to remove all the differences with the original
implementation from GPT-2
#309: Improved the management of the additional vocabulary. This introduces an option
normalized, controlling whether a token should be extracted from the normalized version of the
input text.

Topic		Replies	Views
Transformers v3.0.0 is out! 🤗Transformers	0	1937	July 7, 2020
AutoTokenizer.encode with multiThread and mutliProcess 🤗Tokenizers	2	276	October 9, 2024
Transformers v4.0.0 announcement 🤗Transformers	2	2248	November 12, 2020
Speed up tokenizer training 🤗Tokenizers	5	1224	September 17, 2024
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12847	February 12, 2024