RoBERTa from scratch with different vocab vs. fine-tuning

marrrcin · August 3, 2020, 7:50am

I have a question about training custom RoBERTa model. My corpus consists of 100% english text, but the structure of the text I have is totally different than well formed english books / wikipedia sentences. As the overall nomenclature of my dataset is very different form books / wikipedia I wanted to train new LM from scratch using a new tokenizer trained on my dataset, to capture this corpus-specific nomenclature.

I would like to hear from experts - which of the following approaches is the best one for my case?

Train custom tokenizer and train RoBERTa from scratch
Just fine tune pretrained RoBERTa and rely on the existing BPE tokenizer
Use pretrained RoBERTa and somehow adjust the vocab (if it’s even possible and if so, then how?)

marrrcin · August 6, 2020, 12:09pm

Bump, anyone?

valhalla · August 6, 2020, 3:41pm

You could start from the pre-trained model and see if you get satisfactory results. If not then you can try training from scratch.

chrisdoyleIE · August 6, 2020, 4:49pm

My suggestion is to add some domain specific tokens to the tokenizer’s vocabulary and fine tune the (HF) pre-trained Roberta on your task. This one way to bootstrap to a new domain.

marrrcin · August 11, 2020, 8:19am

Any tips on how to find those domain specific tokens that I’m missing? Should I train new tokenizer from scratch on my own dataset and then perform diff with original tokenizer to add some missing tokens?

marrrcin · August 19, 2020, 5:29pm

Bump again, anyone?

rsk97 · August 20, 2020, 10:33am

Not completely sure, but if you change the tokeniser you’ll have to retrain the model as well, cause the model would have never seen these “new” tokens that you have.

Please correct me if I’m wrong.

stefan-it · August 20, 2020, 1:37pm

Hi,

you should try to fine-tune the model first. I could only image a few scenarios where it makes sense to train a model from scratch: vocab should be very different, e.g. when your domain are historical texts (or digitized texts with ocr errors…).

And you should have a look at the SciBERT paper (https://arxiv.org/abs/1903.10676) - for some datasets the difference between “normal” BERT and SciBERT is very close…

What are your downstream tasks for evaluation btw.

marrrcin · August 20, 2020, 4:21pm

@stefan-it thanks for your reply!

The domain of my docs is indeed very different.
I will train the tokenizer and see what is the overlap to determine the applicability of training from scratch.

Thanks for the link to the parper too.

What are your downstream tasks for evaluation btw.

Classification, NER and embeddings (similarity search)

@rsk97

Not completely sure, but if you change the tokeniser you’ll have to retrain the model as well, cause the model would have never seen these “new” tokens that you have.

This is true, but my understanding of @chrisdoyleIE 's answer is that extending the existing vocabulary still goes into the flow for fine-tuning, as the internal representation of already existing tokens will not change, am I right?

rsk97 · August 20, 2020, 4:38pm

Yeah, internal representations might not change (depends on what is exactly meant by this) but while fine-tuning the model would learn new facts (or relations) about (or between) existing tokens and new ones added as mentioned by @chrisdoyleIE. At least this is what I can think of.

Hope this makes sense.

Topic		Replies	Views
RoBERTa MLM fine-tuning Beginners	1	1874	November 24, 2021
Fine tune a saved model with custom tokenizer 🤗Transformers	3	2967	December 15, 2020
Domain adaptation of Language Model and Tokenizer Beginners	8	2879	June 17, 2024
Continue Pre-Training Roberta Intermediate	3	2694	May 18, 2023
Training embeddings of tokens 🤗Transformers	2	5208	January 27, 2021

RoBERTa from scratch with different vocab vs. fine-tuning

Related topics