Cannot create an identical PretrainedTokenizerFast object from a Tokenizer created by tokenizers library

When using PreTrainedTokenizerFast directly and not one of the subclasses, you have to manually set all the attributes specific to Transformers: the model_max_length as well as all the special tokens. The reason is that the Tokenizer has no concept of associated model (so it doesn’t know the model max length) and even if it has a concept of special tokens, it doesn’t know the differences between them, so you have to indicate which one is the pad token, which one the mask token etc.

1 Like