Currently, I am pre-training roberta with MLM from scratch by : 1. Training a tokenizer on my domain (all the texts that I have) 2. Masking 15% of the words (which are not special tokens) 3. Passing attention mask, labels and input_ids to a RobertaForMaskedLM. It has been training for a couple of hours and it seems OK.
My questions are: because I have around 2GB of data (which is not much), my idea is to “further pre-train” the roberta model (just use RobertaForMaskedLM.from_pretrained(…)) , like a transfer-learning task. This seems smart, but catastrophic forgetting may occur, and what mostly concerns me is: I have created a entirely new tokenizer on my own data, but roberta originally has another tokenizer. If I want to further pre-train with a different tokenizer, this will totally confuse the model, since the old weights would be for different tokenizer, right? How can I approach the problem with further pre-training then?
My tokenizer code:
from pathlib import Path from tokenizers import ByteLevelBPETokenizer paths = [str(x) for x in Path('data').glob('*.txt')] tokenizer = ByteLevelBPETokenizer() tokenizer.train(files=paths, vocab_size=50_000, min_frequency=2, special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>']) tokenizer.save_model('CustomBertTokenizer')]