I am using the add_tokens function in order to get a larger vocabulary for the distilgpt2 pre-trained tokenizer. But, when I do it, it changes the tokenizer’s behaviour when doing the encoding: it breaks words, in order to identify my new token.
I don’t understand why it does it, and whether is possible to force it to work as it was before.
I provide an example:
from transformers import AutoTokenizer, TFAutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("distilgpt2") model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")
If I use the word ‘crypt’, it does not break it, even if there are existing subwords in it:
tokenizer.encode('crypt') ##  tokenizer.encode('cry') ##  tokenizer.encode('pt') ## 
But if I add “ryp”, it breaks the words (and I don’t want them broken! I just want to add the full word “ryp”)
tokenizer.add_tokens('ryp') model.resize_token_embeddings(len(tokenizer)) tokenizer.encode('crypt') # [66, 50257, 83]
Would you know what this happens, and how I can force it to work as before?
I read the docs about the BPE encoding, but I don’t find how to force it to use the largest found token.