I have a question about training custom RoBERTa model. My corpus consists of 100% english text, but the structure of the text I have is totally different than well formed english books / wikipedia sentences. As the overall nomenclature of my dataset is very different form books / wikipedia I wanted to train new LM from scratch using a new tokenizer trained on my dataset, to capture this corpus-specific nomenclature.
I would like to hear from experts - which of the following approaches is the best one for my case?
Train custom tokenizer and train RoBERTa from scratch
Just fine tune pretrained RoBERTa and rely on the existing BPE tokenizer
Use pretrained RoBERTa and somehow adjust the vocab (if it’s even possible and if so, then how?)
My suggestion is to add some domain specific tokens to the tokenizer’s vocabulary and fine tune the (HF) pre-trained Roberta on your task. This one way to bootstrap to a new domain.
Any tips on how to find those domain specific tokens that I’m missing? Should I train new tokenizer from scratch on my own dataset and then perform diff with original tokenizer to add some missing tokens?
Not completely sure, but if you change the tokeniser you’ll have to retrain the model as well, cause the model would have never seen these “new” tokens that you have.
you should try to fine-tune the model first. I could only image a few scenarios where it makes sense to train a model from scratch: vocab should be very different, e.g. when your domain are historical texts (or digitized texts with ocr errors…).
And you should have a look at the SciBERT paper (https://arxiv.org/abs/1903.10676) - for some datasets the difference between “normal” BERT and SciBERT is very close…
What are your downstream tasks for evaluation btw.
The domain of my docs is indeed very different.
I will train the tokenizer and see what is the overlap to determine the applicability of training from scratch.
Thanks for the link to the parper too.
What are your downstream tasks for evaluation btw.
Classification, NER and embeddings (similarity search)
Not completely sure, but if you change the tokeniser you’ll have to retrain the model as well, cause the model would have never seen these “new” tokens that you have.
This is true, but my understanding of @chrisdoyleIE 's answer is that extending the existing vocabulary still goes into the flow for fine-tuning, as the internal representation of already existing tokens will not change, am I right?
Yeah, internal representations might not change (depends on what is exactly meant by this) but while fine-tuning the model would learn new facts (or relations) about (or between) existing tokens and new ones added as mentioned by @chrisdoyleIE. At least this is what I can think of.