v3.X introduces a new API for all tokenizers. There are no breaking changes in the user-facing methods (encode, encode_plus, batch_encode_plus, tokenize, convert_XXX, prepare_for_model), it introduces a new main entry point which is tokenizer.__call__. You can look at the new API by browsing the tokenizers tutorial, this post is about how to migrate from the old API.
__call__ is now the recommended way to encode all types of inputs when tokenizer.encode (which only return the list of input indices for a single sentence) is not enough i.e. for every case beside simple demo purposes.
Truncating and padding
The new API for padding and truncation uses three arguments to the encoding methods: padding, truncation and max_length. This new way to specify padding/truncation is available in all the user-facing encoding methods: encode, encode_plus, batch_ecode_plus and the newly provided __call__.
All the previously provided ways to do padding/truncation (truncation_strategy, max_length, pad_to_max_length) are still supported without breaking changes but we recommend to use the new API.
Here are the details of all the possible inputs to padding, truncation and max_length:
-
paddingto control the padding (can be provided with a boolean or a string for finer-grained control).paddingaccepts the following values:-
Trueor'longest': pad to the longest sequence in the batch (or no padding if only a single sequence if provided), -
'max_length': pad to a max length specified inmax_lengthor to the max acceptable input length for the model if no length is provided (max_length=None) -
Falseor'do_not_pad'(default): No padding (i.e. can output batch with sequences of uneven lengths)
-
-
truncationto control truncation (can be provided with a boolean or a string for finer-grained control).truncationaccepts the following values:-
Trueor'only_first': truncate to a max length specified inmax_lengthor to the max acceptable input length for the model if no length is provided (max_length=None). This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided, -
'only_second': truncate to a max length specified inmax_lengthor to the max acceptable input length for the model if no length is provided (max_length=None). This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided, -
'longest_first': truncate to a max length specified inmax_lengthor to the max acceptable input length for the model if no length is provided (max_length=None). This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided, -
Falseor'do_not_truncate'(default): No truncation (i.e. can output batch with sequences length greater than the model max admissible input size)
-
-
max_lengthto control the length of the padding/truncation (integer orNone).max_lengthaccepts the following values:-
None(default): This will use the predefined model max length if required by one of the truncation/padding parameters. If the model has no specific max input length (e.g. XLNet) truncation/padding to max length is deactivated. -
any integer value(e.g.42): Use this specific maximum length value if required by one of the truncation/padding parameters.
-
Now here are three tables summarizing the recommended way to setup padding and truncation as well as the previously provided way to do it (still supported but not recommended) in all cases.
If you use pair of inputs sequence in any of the following examples, you can replace truncation=True by a STRATEGY selected in ['only_first', 'only_second', 'longest_first'], i.e. truncation='only_second' or truncation= 'longest_first' to control how both sequence in the pair are truncated as detailed just before the table. We don’t include all these variants for the sake of keeping the table not too long.
No truncation
| Padding | Recommended way | Previously provided (still supported but not recommended) |
|---|---|---|
| no padding | tokenizer(batch_sentences) |
tokenizer.batch_encode_plus(batch_sentences) |
| padding to max sequence in batch |
tokenizer(batch_sentences, padding=True) or tokenizer(batch_sentences, padding='longest')
|
tokenizer.batch_encode_plus(batch_sentences, pad_to_max_length=True) |
| padding to max model input length | tokenizer(batch_sentences, padding='max_length') |
Not possible |
| padding to specific length | tokenizer(batch_sentences, padding='max_length', max_length=42) |
Not possible |
Truncation to max model input length
| Padding | Recommended way | Previously provided (still supported but not recommended) |
|---|---|---|
| no padding |
tokenizer(batch_sentences, truncation=True) or tokenizer(batch_sentences, truncation=STRATEGY)
|
tokenizer.batch_encode_plus(batch_sentences, max_length=tokenizer.max_len) |
| padding to max sequence in batch |
tokenizer(batch_sentences, padding=True, truncation=True) or tokenizer(batch_sentences, padding=True, truncation=STRATEGY)
|
Not possible |
| padding to max model input length |
tokenizer(batch_sentences, padding='max_length', truncation=True) or tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)
|
tokenizer.batch_encode_plus(batch_sentences, pad_to_max_length=True, max_length=tokenizer.max_len) |
| padding to specific length | Not possible | Not possible |
Truncation to specific length
| Padding | Recommended way | Previously provided (still supported but not recommended) |
|---|---|---|
| no padding |
tokenizer(batch_sentences, truncation=True, max_length=42) or tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)
|
tokenizer.batch_encode_plus(batch_sentences, max_length=42) |
| padding to max sequence in batch |
tokenizer(batch_sentences, padding=True, truncation=True, max_length=42) or tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)
|
Not possible |
| padding to max model input length | Not possible | Not possible |
| padding to specific length |
tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42) or tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)
|
tokenizer.batch_encode_plus(batch_sentences, pad_to_max_length=True, max_length=42) |