2 possible bugs for adding new tokens to T5


I think that I found 2 issues while trying to add new tokens to the T5 tokenizer. My goal was to add smaller sign “<” to the vocabulary of T5. However, doing this prevents the model from extracting the eos_token and unk_token correctly. When the model encounters or , the model splits it as < and unk> or < and /s>. (I am adding the eos tokens to the end by myself and they will be extracted as normal tokens and also occur in the prediction even if skip_special_tokens=True)

So to overcome this, I decided to change the eos_token to ~end~. This works but then I observed that the prediction results were much worse! Especially at the end of the prediction. Actually everything was okay but suddenly at the end there were always some strange additional generation. I tracked down the issue and I found it.

The custom eos_token is added to the vocabulary as a new token, instead of overwriting the existing eos_token. This means “</s>” has id 1, when I add my own custom eos_token, it does not overwrite 1, or it will not be mapped to 1. It adds a new token to vocabulary but then all the pretaining for eos token is gone! The model has to learn the eos_token from stracth and this degenerates the results remarkably. If you could run the code below and observe the output, you will understand better what I mean!

tokenizer = T5Tokenizer.from_pretrained(model_name,)

print('len_tokenizer with default eos token: ', len(tokenizer))

tokenizer = T5Tokenizer.from_pretrained(model_name, eos_token='~end~')

print('len_tokenizer with custom eos token: ', len(tokenizer))

print('id 1 before adding <:', tokenizer.decode(1, skip_special_tokens=False))

print('id 2 before adding <:', tokenizer.decode(2, skip_special_tokens=False))

tokenizer.add_tokens(['{', '}', '<', '>', '\\'])

print('id 1 after adding <:', tokenizer.decode(1, skip_special_tokens=False))

print('id 2 after adding <:', tokenizer.decode(2, skip_special_tokens=False))

custom_eos_id = tokenizer.encode("~end~", return_tensors='pt', truncation=True, padding=True)

print('custom_eos id: ', custom_eos_id)

Thank you for your time!

Hm currently we don’t have a simple high-level way to change the string associated to a token.

You could manually edit the tokenizer.JSON file generated by the fast version of the T5 tokenizer I guess but that’s a bit hacky.

1 Like