v3.X introduces a new API for all tokenizers. There are no breaking changes in the user-facing methods (encode
, encode_plus
, batch_encode_plus
, tokenize
, convert_XXX
, prepare_for_model
), it introduces a new main entry point which is tokenizer.__call__
. You can look at the new API by browsing the tokenizers tutorial, this post is about how to migrate from the old API.
__call__
is now the recommended way to encode all types of inputs when tokenizer.encode
(which only return the list of input indices for a single sentence) is not enough i.e. for every case beside simple demo purposes.
Truncating and padding
The new API for padding and truncation uses three arguments to the encoding methods: padding
, truncation
and max_length
. This new way to specify padding/truncation is available in all the user-facing encoding methods: encode
, encode_plus
, batch_ecode_plus
and the newly provided __call__
.
All the previously provided ways to do padding/truncation (truncation_strategy
, max_length
, pad_to_max_length
) are still supported without breaking changes but we recommend to use the new API.
Here are the details of all the possible inputs to padding
, truncation
and max_length
:
-
padding
to control the padding (can be provided with a boolean or a string for finer-grained control).padding
accepts the following values:-
True
or'longest'
: pad to the longest sequence in the batch (or no padding if only a single sequence if provided), -
'max_length'
: pad to a max length specified inmax_length
or to the max acceptable input length for the model if no length is provided (max_length=None
) -
False
or'do_not_pad'
(default): No padding (i.e. can output batch with sequences of uneven lengths)
-
-
truncation
to control truncation (can be provided with a boolean or a string for finer-grained control).truncation
accepts the following values:-
True
or'only_first'
: truncate to a max length specified inmax_length
or to the max acceptable input length for the model if no length is provided (max_length=None
). This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided, -
'only_second'
: truncate to a max length specified inmax_length
or to the max acceptable input length for the model if no length is provided (max_length=None
). This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided, -
'longest_first'
: truncate to a max length specified inmax_length
or to the max acceptable input length for the model if no length is provided (max_length=None
). This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided, -
False
or'do_not_truncate'
(default): No truncation (i.e. can output batch with sequences length greater than the model max admissible input size)
-
-
max_length
to control the length of the padding/truncation (integer orNone
).max_length
accepts the following values:-
None
(default): This will use the predefined model max length if required by one of the truncation/padding parameters. If the model has no specific max input length (e.g. XLNet) truncation/padding to max length is deactivated. -
any integer value
(e.g.42
): Use this specific maximum length value if required by one of the truncation/padding parameters.
-
Now here are three tables summarizing the recommended way to setup padding
and truncation
as well as the previously provided way to do it (still supported but not recommended) in all cases.
If you use pair of inputs sequence in any of the following examples, you can replace truncation=True
by a STRATEGY
selected in ['only_first', 'only_second', 'longest_first']
, i.e. truncation='only_second'
or truncation= 'longest_first'
to control how both sequence in the pair are truncated as detailed just before the table. We don’t include all these variants for the sake of keeping the table not too long.
No truncation
Padding | Recommended way | Previously provided (still supported but not recommended) |
---|---|---|
no padding | tokenizer(batch_sentences) |
tokenizer.batch_encode_plus(batch_sentences) |
padding to max sequence in batch |
tokenizer(batch_sentences, padding=True) or tokenizer(batch_sentences, padding='longest')
|
tokenizer.batch_encode_plus(batch_sentences, pad_to_max_length=True) |
padding to max model input length | tokenizer(batch_sentences, padding='max_length') |
Not possible |
padding to specific length | tokenizer(batch_sentences, padding='max_length', max_length=42) |
Not possible |
Truncation to max model input length
Padding | Recommended way | Previously provided (still supported but not recommended) |
---|---|---|
no padding |
tokenizer(batch_sentences, truncation=True) or tokenizer(batch_sentences, truncation=STRATEGY)
|
tokenizer.batch_encode_plus(batch_sentences, max_length=tokenizer.max_len) |
padding to max sequence in batch |
tokenizer(batch_sentences, padding=True, truncation=True) or tokenizer(batch_sentences, padding=True, truncation=STRATEGY)
|
Not possible |
padding to max model input length |
tokenizer(batch_sentences, padding='max_length', truncation=True) or tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)
|
tokenizer.batch_encode_plus(batch_sentences, pad_to_max_length=True, max_length=tokenizer.max_len) |
padding to specific length | Not possible | Not possible |
Truncation to specific length
Padding | Recommended way | Previously provided (still supported but not recommended) |
---|---|---|
no padding |
tokenizer(batch_sentences, truncation=True, max_length=42) or tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)
|
tokenizer.batch_encode_plus(batch_sentences, max_length=42) |
padding to max sequence in batch |
tokenizer(batch_sentences, padding=True, truncation=True, max_length=42) or tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)
|
Not possible |
padding to max model input length | Not possible | Not possible |
padding to specific length |
tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42) or tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)
|
tokenizer.batch_encode_plus(batch_sentences, pad_to_max_length=True, max_length=42) |