How to configure TokenizerFast for AutoTokenizer

vblagoje · November 4, 2021, 12:08pm

Hi there,

I made a custom model and tokenizer for Retribert architecture. For some reason, when using AutoTokenizer.from_pretrained method, the tokenizer does not initialize model_max_len tokenizer attribute to 512 but to a default of a very large integer. If I invoke AutoTokenizer.from_pretrained with an additional max_len=512 kwarg then the model_max_len gets set to 512 as expected. However, as you might expect I don’t want users to pass this additional kwarg but would prefer to somehow set this value by default.

I figured out that TokenizerFast gets initialized from tokenizer.json and I attempted to add model_max_len attribute to tokenizer.json. However, as soon as I do that AutoTokenizer complains that it can not load the JSON file any longer. Perhaps this property can’t be set via tokenizer.json or perhaps I am not adding it at the right JSON node.

Any ideas on how to set model_max_len tokenizer property so that AutoTokenizer picks it up without additional kwargs?

Best,
Vladimir

vblagoje · November 4, 2021, 12:27pm

Hi again,

The model_max_len default seems to be initialized in the modelling class itself. See for example Retribert and how it gets set in the tokenizer for the official HF models, known models and their registered paths. Because the path of my model is not registered with the HF codebase this value does not get set.

Which leaves me confused about how to set model_max_len value even more?

Vladimir

vblagoje · November 11, 2021, 7:37am

Hi, I figured this one out; leaving a small note if you stumble on this issue yourself. All you need to do is add tokenizer_config.json file with additional configs for the tokenizer. I added a simple tokenizer_config.json with the following contents:
{“model_max_length”: 512}

That’s all.

Cheers,
Vladimir

Topic		Replies	Views
Cannot create an identical PretrainedTokenizerFast object from a Tokenizer created by tokenizers library 🤗Tokenizers	1	1091	August 30, 2021
[Tokenizers]What this max_length number? Beginners	3	2469	March 3, 2025
How do I increase max_new_tokens Beginners	3	29251	August 19, 2023
Model max length not set. Default value 🤗Transformers	1	631	October 6, 2024
Fast tokenizer for marianMTModel 🤗Tokenizers	1	513	September 26, 2022

How to configure TokenizerFast for AutoTokenizer

Related topics