ByT5: problem with tokenizer.decode()

[ EDIT ] : there is a bug in the 4.11.0. Back to 4.9.2 solves the issue related here (but can create others? like this one ByT5 tokenizer gives indices of chars instead of bytes?)


Hi.

I’ve created a Colab notebook to show a problem when using google/byt5-small from the model hub of Hugging Face and model.generate() .

Observations:

If someone could run my notebook and tell me what I did wrong or what could be a solution, I would appreciate it because this problem, besides preventing using ByT5 in inference, prevents its finetuning since when evaluating the model at the end of an epoch, the method tokenizer.convert_tokens_to_string() is called by the script … which suddenly fails). Thanks.

cc @patrickvonplaten, @valhalla, @sshleifer

Screen shots from the notebook

I just discussed on Twitter with Nick Doiron that published the notebook ByT5-Finetuning-Datasets.ipynb.

He wrote the following tweet on the issue showed in my notebook:

It must be due to recent changes - transformers==4.9.2 works. decode got broken
The issue is because instead of 256 byte tokens, there are 3 special bytes (total token ids 0-258) and it isn’t accounting for those

My notebook has been updated with the following code and now, it works :slight_smile:

!pip install transformers==4.9.2

I guess that the HF team will solve this issue in the next transformers version.

Issue open in github ByT5: problem with tokenizer.decode() #13779

@patrickvonplaten closed this issue (see explanation) with the return of errors="ignore" in decode("utf-8", errors="ignore") (see commit).