Adding new tokens while preserving tokenization of adjacent tokens

I’m trying to add some new tokens to BERT and RoBERTa tokenizers so that I can fine-tune the models on a new word. The idea is to fine-tune the models on a limited set of sentences with the new word, and then see what it predicts about the word in other, different contexts, to examine the state of the model’s knowledge of certain properties of language.

In order to do this, I’d like to add the new tokens and essentially treat them like new ordinary words (that the model just hasn’t happened to encounter yet). They should behave exactly like normal words once added, with the exception that their embedding matrices will be randomly initialized and then be learned during fine-tuning.

However, I’m running into some issues doing this. In particular, the tokens surrounding the newly added tokens do not behave as expected when initializing the tokenizer with do_basic_tokenize=False. The problem can be observed in the following example; in the case of BERT, the period following the newly added token is not tokenized as a subword (i.e., it is tokenized as . instead of as the expected ##.), and in the case of RoBERTa, the word following the newly added subword is treated as though it does not have a preceding space (i.e., it is tokenized as a instead of as Ġa.

from transformers import BertTokenizer, RobertaTokenizer

new_word = 'mynewword'
bert = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize = False)
bert.tokenize('mynewword') # does not exist yet
# ['my', '##ne', '##w', '##word']
bert.tokenize('testing.')
# ['testing', '##.']

bert.add_tokens(new_word)
bert.tokenize('mynewword') # now it does
# ['mynewword']
bert.tokenize('mynewword.')
# ['mynewword', '.']

roberta = RobertaTokenizer.from_pretrained('roberta-base', do_basic_tokenize = False)
roberta.tokenize('mynewword') # does not exist yet
# ['my', 'new', 'word']
roberta.tokenize('A testing a')
# ['A', 'Ä testing', 'Ä a']

roberta.add_tokens(new_word)
roberta.tokenize('mynewword') # now it does
# ['mynewword']
roberta.tokenize('A mynewword a')
# ['A', 'mynewword', 'a']

Is there a way for me to add the new tokens while getting the behavior of the surrounding tokens to match what it would be if there were not an added token there? I feel like it’s important because the model could end up learning that (for instance), the new token can occur before ., while most others can only occur before ##. That seems like it would affect how it generalizes. In addition, I could turn on basic tokenization to solve the BERT problem here, but that wouldn’t really reflect the full state of the model’s knowledge, since it collapses the distinction between different tokens. And that doesn’t help with the RoBERTa problem, which is still there regardless.

In addition, I’d ideally be able to add the RoBERTa token as Ġmynewword, but I’m assuming that as long as it never occurs as the first word in a sentence, that shouldn’t matter.

2 Likes

Hey @mawilson if you want to add new tokens to the vocabulary, then in general you’ll need to resize the embedding layers with

model.resize_token_embeddings(len(tokenizer))

You can see a full example in the docs - does that help solve your problem?

Unfortunately, it doesn’t seem to—I’ve been resizing the embedding layers using the code you provided, but that doesn’t seem to affect the behavior of the tokenizer itself, so the inputs to the model are still affected.

The example in the docs you provided doesn’t seem to run into the same issue, but only because BertTokenizerFast doesn’t appear to tokenize periods as subword tokens to begin with. If I run that example but swap in BertTokenizer (with do_basic_tokenize=False) or RobertaTokenizer, the same issue is still present.

(I left the models out of the code from before since the behavior doesn’t seem to rely on them, but in case it’s useful here’s an updated version of the example above that shows the same behavior.)

from transformers import BertForMaskedLM, BertTokenizer, RobertaForMaskedLM, RobertaTokenizer

new_word = 'mynewword'
bert_t = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize = False)
bert_m = BertForMaskedLM.from_pretrained('bert-base-uncased')
bert_t.tokenize('mynewword') # does not exist yet
# ['my', '##ne', '##w', '##word']
bert_t.tokenize('testing.')
# ['testing', '##.']

bert_t.add_tokens(new_word)
bert_m.resize_token_embeddings(len(bert_t))
bert_t.tokenize('mynewword') # now it does
# ['mynewword']
bert_t.tokenize('mynewword.')
# ['mynewword', '.']

roberta_t = RobertaTokenizer.from_pretrained('roberta-base', do_basic_tokenize = False)
roberta_m = RobertaForMaskedLM.from_pretrained('roberta-base')
roberta_t.tokenize('mynewword') # does not exist yet
# ['my', 'new', 'word']
roberta_t.tokenize('A testing a')
# ['A', 'Ä testing', 'Ä a']

roberta_t.add_tokens(new_word)
roberta_m.resize_token_embeddings(len(roberta_t))
roberta_t.tokenize('mynewword') # now it does
# ['mynewword']
roberta_t.tokenize('A mynewword a')
# ['A', 'mynewword', 'a']

Hi @mawilson

I’m facing the same issue at the moment. Have you found a solution yet?

Same issue here.

It is also not clear if one should add the “mynewword” to the tokenizer or “Ġmynewword”. Any feedback on that?