How to configure TokenizerFast for AutoTokenizer

Hi there,

I made a custom model and tokenizer for Retribert architecture. For some reason, when using AutoTokenizer.from_pretrained method, the tokenizer does not initialize model_max_len tokenizer attribute to 512 but to a default of a very large integer. If I invoke AutoTokenizer.from_pretrained with an additional max_len=512 kwarg then the model_max_len gets set to 512 as expected. However, as you might expect I don’t want users to pass this additional kwarg but would prefer to somehow set this value by default.

I figured out that TokenizerFast gets initialized from tokenizer.json and I attempted to add model_max_len attribute to tokenizer.json. However, as soon as I do that AutoTokenizer complains that it can not load the JSON file any longer. Perhaps this property can’t be set via tokenizer.json or perhaps I am not adding it at the right JSON node.

Any ideas on how to set model_max_len tokenizer property so that AutoTokenizer picks it up without additional kwargs?

Best,
Vladimir

Hi again,

The model_max_len default seems to be initialized in the modelling class itself. See for example Retribert and how it gets set in the tokenizer for the official HF models, known models and their registered paths. Because the path of my model is not registered with the HF codebase this value does not get set.

Which leaves me confused about how to set model_max_len value even more? :slight_smile:

Vladimir

Hi, I figured this one out; leaving a small note if you stumble on this issue yourself. All you need to do is add tokenizer_config.json file with additional configs for the tokenizer. I added a simple tokenizer_config.json with the following contents:
{“model_max_length”: 512}

That’s all.

Cheers,
Vladimir