Further pre-training the tokenizer?

Hello!

Currently, I am pre-training roberta with MLM from scratch by : 1. Training a tokenizer on my domain (all the texts that I have) 2. Masking 15% of the words (which are not special tokens) 3. Passing attention mask, labels and input_ids to a RobertaForMaskedLM. It has been training for a couple of hours and it seems OK.

My questions are: because I have around 2GB of data (which is not much), my idea is to “further pre-train” the roberta model (just use RobertaForMaskedLM.from_pretrained(…)) , like a transfer-learning task. This seems smart, but catastrophic forgetting may occur, and what mostly concerns me is: I have created a entirely new tokenizer on my own data, but roberta originally has another tokenizer. If I want to further pre-train with a different tokenizer, this will totally confuse the model, since the old weights would be for different tokenizer, right? How can I approach the problem with further pre-training then?

My tokenizer code:

from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path('data').glob('*.txt')]

tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files=paths, vocab_size=50_000, min_frequency=2,
                special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

tokenizer.save_model('CustomBertTokenizer')]

Thanks.

2 Likes