Tokenizers v0.8.0 is out!

Highlights of this release

  • We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when
    processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps
    while applying labels to each word.
  • Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later
    load it back with just one line of code. That’s what sharing a Tokenizer means now: 1 line of code.
  • With the serialization comes the compatibility with Pickle! The Tokenizer, all of its components,
    Encodings, everything can be pickled!
  • Training a tokenizer is now even faster (up to 5-10x) than before!
  • Compatibility with multiprocessing, even when using the fork start method. Since this library
    makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization,
    this led to problems (deadlocks) when used with multiprocessing. This version now allows to
    disable the parallelism, and will warn you if this is necessary.
  • And a lot of other improvements, and fixes.

Fixed

  • #286: Fix various crash when training a BPE model
  • #309: Fixed a few bugs related to additional vocabulary/tokens

Added

  • #272: Serialization of the Tokenizer and all the parts (PreTokenizer, Normalizer, …).
    This adds some methods to easily save/load an entire tokenizer (from_str, from_file).
  • #273: Tokenizer and its parts are now pickable
  • #289: Ability to pad to a multiple of a specified value. This is especially useful to ensure
    activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with
    enable_padding(pad_to_multiple_of=8) for example.
  • #298: Ability to get the currently set truncation/padding params
  • #311: Ability to enable/disable the parallelism using the TOKENIZERS_PARALLELISM environment
    variable. This is especially usefull when using multiprocessing capabilities, with the fork
    start method, which happens to be the default on Linux systems. Without disabling the parallelism,
    the process dead-locks while encoding. (Cf [#187] for more information)

Changed

  • Improved errors generated during truncation: When the provided max length is too low are
    now handled properly.
  • #249 encode and encode_batch now accept pre-tokenized inputs. When the input is pre-tokenized,
    the argument is_pretokenized=True must be specified.
  • #276: Improve BPE training speeds, by reading files sequentially, but parallelizing the
    processing of each file
  • #280: Use onig for byte-level pre-tokenization to remove all the differences with the original
    implementation from GPT-2
  • #309: Improved the management of the additional vocabulary. This introduces an option
    normalized, controlling whether a token should be extracted from the normalized version of the
    input text.
3 Likes