Cannot create an identical PretrainedTokenizerFast object from a Tokenizer created by tokenizers library

sgugger · August 30, 2021, 6:15pm

When using PreTrainedTokenizerFast directly and not one of the subclasses, you have to manually set all the attributes specific to Transformers: the model_max_length as well as all the special tokens. The reason is that the Tokenizer has no concept of associated model (so it doesn’t know the model max length) and even if it has a concept of special tokens, it doesn’t know the differences between them, so you have to indicate which one is the pad token, which one the mask token etc.

Topic		Replies	Views
Padding not transferring when loading a tokenizer trained via the tokenizers library into transformers 🤗Tokenizers	0	498	June 12, 2023
Padding not working when loading a tokenizer trained via the tokenizers library into transformers 🤗Transformers	1	6247	June 11, 2023
How to create a tokenizers from a custom pretrained tokenizer? 🤗Transformers	1	602	February 24, 2021
Padding and truncation for custom tokenizer 🤗Tokenizers	1	646	January 22, 2023
Tokenizer from tokenizers library cannot be used in transformers.Trainer 🤗Transformers	2	629	July 30, 2021

Cannot create an identical PretrainedTokenizerFast object from a Tokenizer created by tokenizers library

Related topics