ByT5: problem with tokenizer.decode()

[ EDIT ] : there is a bug in the 4.11.0. Back to 4.9.2 solves the issue related here (but can create others? like this one ByT5 tokenizer gives indices of chars instead of bytes?)


I’ve created a Colab notebook to show a problem when using google/byt5-small from the model hub of Hugging Face and model.generate() .


If someone could run my notebook and tell me what I did wrong or what could be a solution, I would appreciate it because this problem, besides preventing using ByT5 in inference, prevents its finetuning since when evaluating the model at the end of an epoch, the method tokenizer.convert_tokens_to_string() is called by the script … which suddenly fails). Thanks.

cc @patrickvonplaten, @valhalla, @sshleifer

Screen shots from the notebook

I just discussed on Twitter with Nick Doiron that published the notebook ByT5-Finetuning-Datasets.ipynb.

He wrote the following tweet on the issue showed in my notebook:

It must be due to recent changes - transformers==4.9.2 works. decode got broken
The issue is because instead of 256 byte tokens, there are 3 special bytes (total token ids 0-258) and it isn’t accounting for those

My notebook has been updated with the following code and now, it works :slight_smile:

!pip install transformers==4.9.2

I guess that the HF team will solve this issue in the next transformers version.

Issue open in github ByT5: problem with tokenizer.decode() #13779

@patrickvonplaten closed this issue (see explanation) with the return of errors="ignore" in decode("utf-8", errors="ignore") (see commit).