Unfortunately, it doesn’t seem to—I’ve been resizing the embedding layers using the code you provided, but that doesn’t seem to affect the behavior of the tokenizer itself, so the inputs to the model are still affected.
The example in the docs you provided doesn’t seem to run into the same issue, but only because BertTokenizerFast
doesn’t appear to tokenize periods as subword tokens to begin with. If I run that example but swap in BertTokenizer
(with do_basic_tokenize=False
) or RobertaTokenizer
, the same issue is still present.
(I left the models out of the code from before since the behavior doesn’t seem to rely on them, but in case it’s useful here’s an updated version of the example above that shows the same behavior.)
from transformers import BertForMaskedLM, BertTokenizer, RobertaForMaskedLM, RobertaTokenizer
new_word = 'mynewword'
bert_t = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize = False)
bert_m = BertForMaskedLM.from_pretrained('bert-base-uncased')
bert_t.tokenize('mynewword') # does not exist yet
# ['my', '##ne', '##w', '##word']
bert_t.tokenize('testing.')
# ['testing', '##.']
bert_t.add_tokens(new_word)
bert_m.resize_token_embeddings(len(bert_t))
bert_t.tokenize('mynewword') # now it does
# ['mynewword']
bert_t.tokenize('mynewword.')
# ['mynewword', '.']
roberta_t = RobertaTokenizer.from_pretrained('roberta-base', do_basic_tokenize = False)
roberta_m = RobertaForMaskedLM.from_pretrained('roberta-base')
roberta_t.tokenize('mynewword') # does not exist yet
# ['my', 'new', 'word']
roberta_t.tokenize('A testing a')
# ['A', 'Ä testing', 'Ä a']
roberta_t.add_tokens(new_word)
roberta_m.resize_token_embeddings(len(roberta_t))
roberta_t.tokenize('mynewword') # now it does
# ['mynewword']
roberta_t.tokenize('A mynewword a')
# ['A', 'mynewword', 'a']