I have a question about training custom RoBERTa model. My corpus consists of 100% english text, but the structure of the text I have is totally different than well formed english books / wikipedia sentences. As the overall nomenclature of my dataset is very different form books / wikipedia I wanted to train new LM from scratch using a new tokenizer trained on my dataset, to capture this corpus-specific nomenclature.
I would like to hear from experts - which of the following approaches is the best one for my case?
- Train custom tokenizer and train RoBERTa from scratch
- Just fine tune pretrained RoBERTa and rely on the existing BPE tokenizer
- Use pretrained RoBERTa and somehow adjust the vocab (if it’s even possible and if so, then how?)