[ EDIT ] : there is a bug in the 4.11.0. Back to 4.9.2 solves the issue related here (but can create others? like this one ByT5 tokenizer gives indices of chars instead of bytes?)
I’ve created a Colab notebook to show a problem when using google/byt5-small from the model hub of Hugging Face and model.generate()
More especifically, the problem comes from the method tokenizer.convert_tokens_to_string() in source code for transformers.models.byt5.tokenization_byt5.
The same problem happens with google/byt5-base.
If someone could run my notebook and tell me what I did wrong or what could be a solution, I would appreciate it because this problem, besides preventing using ByT5 in inference, prevents its finetuning since when evaluating the model at the end of an epoch, the method tokenizer.convert_tokens_to_string()
is called by the script … which suddenly fails). Thanks.
cc @patrickvonplaten, @valhalla, @sshleifer