Hi there,
I made a custom model and tokenizer for Retribert architecture. For some reason, when using AutoTokenizer.from_pretrained method, the tokenizer does not initialize model_max_len tokenizer attribute to 512 but to a default of a very large integer. If I invoke AutoTokenizer.from_pretrained with an additional max_len=512 kwarg then the model_max_len gets set to 512 as expected. However, as you might expect I don’t want users to pass this additional kwarg but would prefer to somehow set this value by default.
I figured out that TokenizerFast gets initialized from tokenizer.json and I attempted to add model_max_len attribute to tokenizer.json. However, as soon as I do that AutoTokenizer complains that it can not load the JSON file any longer. Perhaps this property can’t be set via tokenizer.json or perhaps I am not adding it at the right JSON node.
Any ideas on how to set model_max_len tokenizer property so that AutoTokenizer picks it up without additional kwargs?
Best,
Vladimir