ByT5: problem with tokenizer.decode()

pierreguillou · September 28, 2021, 2:27pm

[ EDIT ] : there is a bug in the 4.11.0. Back to 4.9.2 solves the issue related here (but can create others? like this one ByT5 tokenizer gives indices of chars instead of bytes?)

Hi.

I’ve created a Colab notebook to show a problem when using google/byt5-small from the model hub of Hugging Face and model.generate() .

Observations:

More especifically, the problem comes from the method tokenizer.convert_tokens_to_string() in source code for transformers.models.byt5.tokenization_byt5.
The same problem happens with google/byt5-base.

If someone could run my notebook and tell me what I did wrong or what could be a solution, I would appreciate it because this problem, besides preventing using ByT5 in inference, prevents its finetuning since when evaluating the model at the end of an epoch, the method tokenizer.convert_tokens_to_string() is called by the script … which suddenly fails). Thanks.

cc @patrickvonplaten, @valhalla, @sshleifer

Screen shots from the notebook

pierreguillou · September 28, 2021, 5:51pm

I just discussed on Twitter with Nick Doiron that published the notebook ByT5-Finetuning-Datasets.ipynb.

He wrote the following tweet on the issue showed in my notebook:

It must be due to recent changes - transformers==4.9.2 works. decode got broken
The issue is because instead of 256 byte tokens, there are 3 special bytes (total token ids 0-258) and it isn’t accounting for those

My notebook has been updated with the following code and now, it works

!pip install transformers==4.9.2

I guess that the HF team will solve this issue in the next transformers version.

pierreguillou · September 28, 2021, 6:33pm

Issue open in github ByT5: problem with tokenizer.decode() #13779

pierreguillou · October 15, 2021, 11:53am

@patrickvonplaten closed this issue (see explanation) with the return of errors="ignore" in decode("utf-8", errors="ignore") (see commit).

Topic		Replies	Views
ValueError: Tokenizer class ByT5Tokenizer does not exist or is not currently imported Model cards	2	5552	June 7, 2021
GPT2Tokenizer.decode maps unicode sequences to the same string '�' 🤗Tokenizers	3	1217	March 15, 2023
UnicodeDecodeError with xprophetnet-large-wiki100-cased-xglue-qg model Beginners	1	792	June 29, 2021
Encoding error while fine-tuning 🤗Transformers	2	3540	August 10, 2021
Tokenizer decoding using BERT, RoBERTa, XLNet, GPT2 Beginners	7	8636	September 21, 2020

ByT5: problem with tokenizer.decode()

Screen shots from the notebook

Related topics