[ EDIT ] : there is a bug in the 4.11.0. Back to 4.9.2 solves the issue related here (but can create others? like this one ByT5 tokenizer gives indices of chars instead of bytes?)
Hi.
I’ve created a Colab notebook to show a problem when using google/byt5-small from the model hub of Hugging Face and model.generate()
.
Observations:
-
More especifically, the problem comes from the method tokenizer.convert_tokens_to_string() in source code for transformers.models.byt5.tokenization_byt5.
-
The same problem happens with google/byt5-base.
If someone could run my notebook and tell me what I did wrong or what could be a solution, I would appreciate it because this problem, besides preventing using ByT5 in inference, prevents its finetuning since when evaluating the model at the end of an epoch, the method tokenizer.convert_tokens_to_string()
is called by the script … which suddenly fails). Thanks.
cc @patrickvonplaten, @valhalla, @sshleifer