When using PreTrainedTokenizerFast
directly and not one of the subclasses, you have to manually set all the attributes specific to Transformers: the model_max_length
as well as all the special tokens. The reason is that the Tokenizer
has no concept of associated model (so it doesn’t know the model max length) and even if it has a concept of special tokens, it doesn’t know the differences between them, so you have to indicate which one is the pad token, which one the mask token etc.
1 Like