Further pre-training the tokenizer?

petarulev · April 30, 2022, 2:31pm

Hello!

Currently, I am pre-training roberta with MLM from scratch by : 1. Training a tokenizer on my domain (all the texts that I have) 2. Masking 15% of the words (which are not special tokens) 3. Passing attention mask, labels and input_ids to a RobertaForMaskedLM. It has been training for a couple of hours and it seems OK.

My questions are: because I have around 2GB of data (which is not much), my idea is to “further pre-train” the roberta model (just use RobertaForMaskedLM.from_pretrained(…)) , like a transfer-learning task. This seems smart, but catastrophic forgetting may occur, and what mostly concerns me is: I have created a entirely new tokenizer on my own data, but roberta originally has another tokenizer. If I want to further pre-train with a different tokenizer, this will totally confuse the model, since the old weights would be for different tokenizer, right? How can I approach the problem with further pre-training then?

My tokenizer code:

from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path('data').glob('*.txt')]

tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files=paths, vocab_size=50_000, min_frequency=2,
                special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

tokenizer.save_model('CustomBertTokenizer')]

Thanks.

Topic		Replies	Views
Training RoBERTa from scratch: error? 🤗Transformers	0	588	August 26, 2021
[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task 🤗Transformers	6	216	March 19, 2024
Domain adaptation of Language Model and Tokenizer Beginners	8	2852	June 17, 2024
Fine tune a saved model with custom tokenizer 🤗Transformers	3	2960	December 15, 2020
Error training MLM with Roberta Tokenizer 🤗Tokenizers	1	1444	September 17, 2023

Further pre-training the tokenizer?

Related topics