"Add_tokens" breaks words when encoding

sduran · August 2, 2022, 2:42pm

I am using the add_tokens function in order to get a larger vocabulary for the distilgpt2 pre-trained tokenizer. But, when I do it, it changes the tokenizer’s behaviour when doing the encoding: it breaks words, in order to identify my new token.
I don’t understand why it does it, and whether is possible to force it to work as it was before.

I provide an example:

from transformers import AutoTokenizer, TFAutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")

If I use the word ‘crypt’, it does not break it, even if there are existing subwords in it:

tokenizer.encode('crypt')
## [29609]
tokenizer.encode('cry')
## [20470]
tokenizer.encode('pt')
## [457]

But if I add “ryp”, it breaks the words (and I don’t want them broken! I just want to add the full word “ryp”)

tokenizer.add_tokens('ryp')
model.resize_token_embeddings(len(tokenizer))
tokenizer.encode('crypt')
# [66, 50257, 83]

Would you know what this happens, and how I can force it to work as before?
I read the docs about the BPE encoding, but I don’t find how to force it to use the largest found token.

Thanks !

AlexandrosChariton · February 13, 2023, 5:09pm

from transformers import AutoTokenizer, TFAutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

from transformers.tokenization_utils import AddedToken
tokenizer.add_tokens(AddedToken('ryp', single_word = True))

AddedToken accepts single strings. Then as you can see:

tokenizer.encode('crypt')
#[29609]

Huggingface source is here: Added Tokens

TunyTrinh · August 22, 2023, 4:05am

I would like to express my gratitude to @AlexandrosChariton for providing a solution to a problem I was facing. However, I have tried implementing the suggested method, and while it does work, it doesn’t entirely address my specific case.

Here is my situation:

I have checked the tokens “He” and “Hello” in the vocabulary of the ‘bert-base-cased’ model, and they exist. The results are as follows:

tokenizer.tokenize("Hello")
# Output is ['Hello']

tokenizer.tokenize("He")
# Output is ['He']

tokenizer.tokenize("Helloe")
# Output is ['Hello', 'e']

tokenizer.tokenize("Hee")
# Output is ['He', 'e']

However, when I attempt to use the same method to add a new token, “pr,” using the following code:

tokenizer.add_tokens(AddedToken('pr', single_word=True))

The method works in some cases:

tokenizer.tokenize("present")
# Output is ['present']

tokenizer.tokenize("pr")
# Output is ['pr']

But it doesn’t work in other cases:

tokenizer.tokenize("prmention")
# Output is ['p', '##rm', '##ent', '##ion']
# My expected output is ['pr', '##ment', '##ion'], similar to the output of tokenizer.tokenize('privatemention') which is ['private', '##ment', '##ion']

Could anyone advise me on how to solve this problem? Thank you in advance for your assistance.

Topic		Replies	Views
Adding tokens, but tokenizer doesn't use them 🤗Tokenizers	1	388	August 14, 2024
Tokenizer not recognising words in vocabulary 🤗Tokenizers	4	1837	March 5, 2024
How to save a tokenizer only consisting of added tokens 🤗Tokenizers	0	840	May 11, 2022
How to properly add news tokens to tokenizer vocab? Beginners	0	154	May 14, 2024
Transformers v3.0.0 is out! 🤗Transformers	0	1934	July 7, 2020

"Add_tokens" breaks words when encoding

Related topics