Migration guide from v2.X to v3.X for the tokenizer API

sgugger · July 7, 2020, 7:45pm

v3.X introduces a new API for all tokenizers. There are no breaking changes in the user-facing methods (encode, encode_plus, batch_encode_plus, tokenize, convert_XXX, prepare_for_model), it introduces a new main entry point which is tokenizer.__call__. You can look at the new API by browsing the tokenizers tutorial, this post is about how to migrate from the old API.

__call__ is now the recommended way to encode all types of inputs when tokenizer.encode (which only return the list of input indices for a single sentence) is not enough i.e. for every case beside simple demo purposes.

Truncating and padding

The new API for padding and truncation uses three arguments to the encoding methods: padding, truncation and max_length. This new way to specify padding/truncation is available in all the user-facing encoding methods: encode, encode_plus, batch_ecode_plus and the newly provided __call__.

All the previously provided ways to do padding/truncation (truncation_strategy, max_length, pad_to_max_length) are still supported without breaking changes but we recommend to use the new API.

Here are the details of all the possible inputs to padding, truncation and max_length:

padding to control the padding (can be provided with a boolean or a string for finer-grained control). padding accepts the following values:
- True or 'longest': pad to the longest sequence in the batch (or no padding if only a single sequence if provided),
- 'max_length': pad to a max length specified in max_length or to the max acceptable input length for the model if no length is provided (max_length=None)
- False or 'do_not_pad' (default): No padding (i.e. can output batch with sequences of uneven lengths)
truncation to control truncation (can be provided with a boolean or a string for finer-grained control). truncation accepts the following values:
- True or 'only_first': truncate to a max length specified in max_length or to the max acceptable input length for the model if no length is provided (max_length=None). This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided,
- 'only_second': truncate to a max length specified in max_length or to the max acceptable input length for the model if no length is provided (max_length=None). This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided,
- 'longest_first': truncate to a max length specified in max_length or to the max acceptable input length for the model if no length is provided (max_length=None). This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided,
- False or 'do_not_truncate' (default): No truncation (i.e. can output batch with sequences length greater than the model max admissible input size)
max_length to control the length of the padding/truncation (integer or None). max_length accepts the following values:
- None (default): This will use the predefined model max length if required by one of the truncation/padding parameters. If the model has no specific max input length (e.g. XLNet) truncation/padding to max length is deactivated.
- any integer value (e.g. 42): Use this specific maximum length value if required by one of the truncation/padding parameters.

Now here are three tables summarizing the recommended way to setup padding and truncation as well as the previously provided way to do it (still supported but not recommended) in all cases.

If you use pair of inputs sequence in any of the following examples, you can replace truncation=True by a STRATEGY selected in ['only_first', 'only_second', 'longest_first'], i.e. truncation='only_second' or truncation= 'longest_first' to control how both sequence in the pair are truncated as detailed just before the table. We don’t include all these variants for the sake of keeping the table not too long.

No truncation

Padding	Recommended way	Previously provided (still supported but not recommended)
no padding	`tokenizer(batch_sentences)`	`tokenizer.batch_encode_plus(batch_sentences)`
padding to max sequence in batch	`tokenizer(batch_sentences, padding=True)` or `tokenizer(batch_sentences, padding='longest')`	`tokenizer.batch_encode_plus(batch_sentences, pad_to_max_length=True)`
padding to max model input length	`tokenizer(batch_sentences, padding='max_length')`	Not possible
padding to specific length	`tokenizer(batch_sentences, padding='max_length', max_length=42)`	Not possible

Truncation to max model input length

Padding	Recommended way	Previously provided (still supported but not recommended)
no padding	`tokenizer(batch_sentences, truncation=True)` or `tokenizer(batch_sentences, truncation=STRATEGY)`	`tokenizer.batch_encode_plus(batch_sentences, max_length=tokenizer.max_len)`
padding to max sequence in batch	`tokenizer(batch_sentences, padding=True, truncation=True)` or `tokenizer(batch_sentences, padding=True, truncation=STRATEGY)`	Not possible
padding to max model input length	`tokenizer(batch_sentences, padding='max_length', truncation=True)` or `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)`	`tokenizer.batch_encode_plus(batch_sentences, pad_to_max_length=True, max_length=tokenizer.max_len)`
padding to specific length	Not possible	Not possible

Truncation to specific length

Padding	Recommended way	Previously provided (still supported but not recommended)
no padding	`tokenizer(batch_sentences, truncation=True, max_length=42)` or `tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)`	`tokenizer.batch_encode_plus(batch_sentences, max_length=42)`
padding to max sequence in batch	`tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or `tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)`	Not possible
padding to max model input length	Not possible	Not possible
padding to specific length	`tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)`	`tokenizer.batch_encode_plus(batch_sentences, pad_to_max_length=True, max_length=42)`

Topic		Replies	Views
How padding in huggingface tokenizer works? 🤗Tokenizers	4	6798	November 22, 2021
Transformers v3.0.0 is out! 🤗Transformers	0	1937	July 7, 2020
Why is padding and truncation are optional? Beginners	0	290	January 8, 2023
Purpose of padding and truncating Beginners	7	3344	August 3, 2020
Padding and truncation for custom tokenizer 🤗Tokenizers	1	643	January 22, 2023

Migration guide from v2.X to v3.X for the tokenizer API

Truncating and padding

Related topics