Adding new tokens while preserving tokenization of adjacent tokens

mawilson · December 7, 2021, 4:21am

I’m trying to add some new tokens to BERT and RoBERTa tokenizers so that I can fine-tune the models on a new word. The idea is to fine-tune the models on a limited set of sentences with the new word, and then see what it predicts about the word in other, different contexts, to examine the state of the model’s knowledge of certain properties of language.

In order to do this, I’d like to add the new tokens and essentially treat them like new ordinary words (that the model just hasn’t happened to encounter yet). They should behave exactly like normal words once added, with the exception that their embedding matrices will be randomly initialized and then be learned during fine-tuning.

However, I’m running into some issues doing this. In particular, the tokens surrounding the newly added tokens do not behave as expected when initializing the tokenizer with do_basic_tokenize=False. The problem can be observed in the following example; in the case of BERT, the period following the newly added token is not tokenized as a subword (i.e., it is tokenized as . instead of as the expected ##.), and in the case of RoBERTa, the word following the newly added subword is treated as though it does not have a preceding space (i.e., it is tokenized as a instead of as Ġa.

from transformers import BertTokenizer, RobertaTokenizer

new_word = 'mynewword'
bert = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize = False)
bert.tokenize('mynewword') # does not exist yet
# ['my', '##ne', '##w', '##word']
bert.tokenize('testing.')
# ['testing', '##.']

bert.add_tokens(new_word)
bert.tokenize('mynewword') # now it does
# ['mynewword']
bert.tokenize('mynewword.')
# ['mynewword', '.']

roberta = RobertaTokenizer.from_pretrained('roberta-base', do_basic_tokenize = False)
roberta.tokenize('mynewword') # does not exist yet
# ['my', 'new', 'word']
roberta.tokenize('A testing a')
# ['A', 'Ġtesting', 'Ġa']

roberta.add_tokens(new_word)
roberta.tokenize('mynewword') # now it does
# ['mynewword']
roberta.tokenize('A mynewword a')
# ['A', 'mynewword', 'a']

Is there a way for me to add the new tokens while getting the behavior of the surrounding tokens to match what it would be if there were not an added token there? I feel like it’s important because the model could end up learning that (for instance), the new token can occur before ., while most others can only occur before ##. That seems like it would affect how it generalizes. In addition, I could turn on basic tokenization to solve the BERT problem here, but that wouldn’t really reflect the full state of the model’s knowledge, since it collapses the distinction between different tokens. And that doesn’t help with the RoBERTa problem, which is still there regardless.

In addition, I’d ideally be able to add the RoBERTa token as Ġmynewword, but I’m assuming that as long as it never occurs as the first word in a sentence, that shouldn’t matter.

lewtun · December 7, 2021, 8:31am

Hey @mawilson if you want to add new tokens to the vocabulary, then in general you’ll need to resize the embedding layers with

model.resize_token_embeddings(len(tokenizer))

You can see a full example in the docs - does that help solve your problem?

mawilson · December 7, 2021, 12:29pm

Unfortunately, it doesn’t seem to—I’ve been resizing the embedding layers using the code you provided, but that doesn’t seem to affect the behavior of the tokenizer itself, so the inputs to the model are still affected.

The example in the docs you provided doesn’t seem to run into the same issue, but only because BertTokenizerFast doesn’t appear to tokenize periods as subword tokens to begin with. If I run that example but swap in BertTokenizer (with do_basic_tokenize=False) or RobertaTokenizer, the same issue is still present.

(I left the models out of the code from before since the behavior doesn’t seem to rely on them, but in case it’s useful here’s an updated version of the example above that shows the same behavior.)

from transformers import BertForMaskedLM, BertTokenizer, RobertaForMaskedLM, RobertaTokenizer

new_word = 'mynewword'
bert_t = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize = False)
bert_m = BertForMaskedLM.from_pretrained('bert-base-uncased')
bert_t.tokenize('mynewword') # does not exist yet
# ['my', '##ne', '##w', '##word']
bert_t.tokenize('testing.')
# ['testing', '##.']

bert_t.add_tokens(new_word)
bert_m.resize_token_embeddings(len(bert_t))
bert_t.tokenize('mynewword') # now it does
# ['mynewword']
bert_t.tokenize('mynewword.')
# ['mynewword', '.']

roberta_t = RobertaTokenizer.from_pretrained('roberta-base', do_basic_tokenize = False)
roberta_m = RobertaForMaskedLM.from_pretrained('roberta-base')
roberta_t.tokenize('mynewword') # does not exist yet
# ['my', 'new', 'word']
roberta_t.tokenize('A testing a')
# ['A', 'Ġtesting', 'Ġa']

roberta_t.add_tokens(new_word)
roberta_m.resize_token_embeddings(len(roberta_t))
roberta_t.tokenize('mynewword') # now it does
# ['mynewword']
roberta_t.tokenize('A mynewword a')
# ['A', 'mynewword', 'a']

mkarlos · October 16, 2023, 4:52pm

Hi @mawilson

I’m facing the same issue at the moment. Have you found a solution yet?

hugosousa · January 25, 2024, 3:46pm

Same issue here.

It is also not clear if one should add the “mynewword” to the tokenizer or “Ġmynewword”. Any feedback on that?

Topic		Replies	Views
Add new tokens for subwords 🤗Tokenizers	9	6862	August 11, 2020
Maybe there is a bug in BertTokenizer? 🤗Transformers	0	390	March 19, 2021
Tunning tokenizer on my own dataset 🤗Tokenizers	0	726	January 25, 2021
How to properly add new vocabulary to BPE tokenizers (like Roberta)? Beginners	3	5785	December 8, 2021
"Add_tokens" breaks words when encoding 🤗Tokenizers	2	1305	August 22, 2023

Adding new tokens while preserving tokenization of adjacent tokens

Related topics