2 possible bugs for adding new tokens to T5

berkayberabi · November 11, 2020, 9:58am

Hi,

I think that I found 2 issues while trying to add new tokens to the T5 tokenizer. My goal was to add smaller sign “<” to the vocabulary of T5. However, doing this prevents the model from extracting the eos_token and unk_token correctly. When the model encounters or , the model splits it as < and unk> or < and /s>. (I am adding the eos tokens to the end by myself and they will be extracted as normal tokens and also occur in the prediction even if skip_special_tokens=True)

So to overcome this, I decided to change the eos_token to ~end~. This works but then I observed that the prediction results were much worse! Especially at the end of the prediction. Actually everything was okay but suddenly at the end there were always some strange additional generation. I tracked down the issue and I found it.

The custom eos_token is added to the vocabulary as a new token, instead of overwriting the existing eos_token. This means “</s>” has id 1, when I add my own custom eos_token, it does not overwrite 1, or it will not be mapped to 1. It adds a new token to vocabulary but then all the pretaining for eos token is gone! The model has to learn the eos_token from stracth and this degenerates the results remarkably. If you could run the code below and observe the output, you will understand better what I mean!

tokenizer = T5Tokenizer.from_pretrained(model_name,)

print('len_tokenizer with default eos token: ', len(tokenizer))

tokenizer = T5Tokenizer.from_pretrained(model_name, eos_token='~end~')

print('len_tokenizer with custom eos token: ', len(tokenizer))

print('id 1 before adding <:', tokenizer.decode(1, skip_special_tokens=False))

print('id 2 before adding <:', tokenizer.decode(2, skip_special_tokens=False))

tokenizer.add_tokens(['{', '}', '<', '>', '\\'])

print('id 1 after adding <:', tokenizer.decode(1, skip_special_tokens=False))

print('id 2 after adding <:', tokenizer.decode(2, skip_special_tokens=False))

custom_eos_id = tokenizer.encode("~end~", return_tensors='pt', truncation=True, padding=True)

print('custom_eos id: ', custom_eos_id)

Thank you for your time!

thomwolf · November 12, 2020, 12:48pm

Hm currently we don’t have a simple high-level way to change the string associated to a token.

You could manually edit the tokenizer.JSON file generated by the fast version of the T5 tokenizer I guess but that’s a bit hacky.

leestevennz · June 21, 2024, 11:36pm

Hi @thomwolf ,

I am wanting to replace a token in the existing vocab. After manually editing the tokenizer.json file, when I try to load the tokenizer I’m getting the following error:

Exception: data did not match any variant of untagged enum ModelWrapper at line 356367 column 3

Any ideas how to fix this?

leestevennz · June 25, 2024, 3:47am

You can ignore this sorry. I found the issue. if you change the vocab in anyway, you need to make sure you also update the merges accordingly.

Topic		Replies	Views
How to add EOS when training T5? Intermediate	1	137	October 21, 2024
How to properly add news tokens to tokenizer vocab? Beginners	0	155	May 14, 2024
How to add all standard special tokens to my tokenizer and model? Beginners	1	5895	August 11, 2022
2 tokens for one character in T5 🤗Tokenizers	2	1619	August 10, 2023
T5 tokenizer's post-processor is suboptimal for truncated sequences for seq2seq finetuning 🤗Transformers	0	330	July 5, 2023

2 possible bugs for adding new tokens to T5

Related topics