I am using the add_tokens function in order to get a larger vocabulary for the distilgpt2 pre-trained tokenizer. But, when I do it, it changes the tokenizerâs behaviour when doing the encoding: it breaks words, in order to identify my new token.
I donât understand why it does it, and whether is possible to force it to work as it was before.
I provide an example:
from transformers import AutoTokenizer, TFAutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")
If I use the word âcryptâ, it does not break it, even if there are existing subwords in it:
tokenizer.encode('crypt')
## [29609]
tokenizer.encode('cry')
## [20470]
tokenizer.encode('pt')
## [457]
But if I add ârypâ, it breaks the words (and I donât want them broken! I just want to add the full word ârypâ)
tokenizer.add_tokens('ryp')
model.resize_token_embeddings(len(tokenizer))
tokenizer.encode('crypt')
# [66, 50257, 83]
Would you know what this happens, and how I can force it to work as before?
I read the docs about the BPE encoding, but I donât find how to force it to use the largest found token.
Thanks !