I’m trying to add some new tokens to BERT and RoBERTa tokenizers so that I can fine-tune the models on a new word. The idea is to fine-tune the models on a limited set of sentences with the new word, and then see what it predicts about the word in other, different contexts, to examine the state of the model’s knowledge of certain properties of language.
In order to do this, I’d like to add the new tokens and essentially treat them like new ordinary words (that the model just hasn’t happened to encounter yet). They should behave exactly like normal words once added, with the exception that their embedding matrices will be randomly initialized and then be learned during fine-tuning.
However, I’m running into some issues doing this. In particular, the tokens surrounding the newly added tokens do not behave as expected when initializing the tokenizer with do_basic_tokenize=False
. The problem can be observed in the following example; in the case of BERT, the period following the newly added token is not tokenized as a subword (i.e., it is tokenized as .
instead of as the expected ##.
), and in the case of RoBERTa, the word following the newly added subword is treated as though it does not have a preceding space (i.e., it is tokenized as a
instead of as Ä a
.
from transformers import BertTokenizer, RobertaTokenizer
new_word = 'mynewword'
bert = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize = False)
bert.tokenize('mynewword') # does not exist yet
# ['my', '##ne', '##w', '##word']
bert.tokenize('testing.')
# ['testing', '##.']
bert.add_tokens(new_word)
bert.tokenize('mynewword') # now it does
# ['mynewword']
bert.tokenize('mynewword.')
# ['mynewword', '.']
roberta = RobertaTokenizer.from_pretrained('roberta-base', do_basic_tokenize = False)
roberta.tokenize('mynewword') # does not exist yet
# ['my', 'new', 'word']
roberta.tokenize('A testing a')
# ['A', 'Ä testing', 'Ä a']
roberta.add_tokens(new_word)
roberta.tokenize('mynewword') # now it does
# ['mynewword']
roberta.tokenize('A mynewword a')
# ['A', 'mynewword', 'a']
Is there a way for me to add the new tokens while getting the behavior of the surrounding tokens to match what it would be if there were not an added token there? I feel like it’s important because the model could end up learning that (for instance), the new token can occur before .
, while most others can only occur before ##.
That seems like it would affect how it generalizes. In addition, I could turn on basic tokenization to solve the BERT problem here, but that wouldn’t really reflect the full state of the model’s knowledge, since it collapses the distinction between different tokens. And that doesn’t help with the RoBERTa problem, which is still there regardless.
In addition, I’d ideally be able to add the RoBERTa token as Ä mynewword
, but I’m assuming that as long as it never occurs as the first word in a sentence, that shouldn’t matter.