Continue Pre-Training Roberta

I want to further train the Roberta Model on my own dataset.

General Idea :

I want to resume the training from the last point where Roberta was trained and continue that training on my dataset.

I have a fairly small dataset - 60K Sentences mostly in English (Domain Specific). I want to leverage transfer learning in Transformers.

Is it even possible? If so how?

Yes, it surely is.
Roberta is not the easiest model for finetuning since it can be used for many tasks. I am sure you will find a finetuned version of a roberta checkpoint for your specific task in here. You can simply copy a lot of the hyperparameters and tweak the others like batch size, learning rate and maybe even some of the adam values.
All of this can be done using HFs Trainer class, even though the training itselfs depends on the task you want to use it for.
For ex. Text classification can be done easily following this notebook

For additional resources, you might want to take a look at plain BERT. There are way more example scripts for training and the architecture is the same except for the tokenizer.
See the BERT Model page, it will be helpful - many guides await you.

Good luck finetuning!

The original question is to continue pre-training the model, so I’m not sure if fine-tuning is the right answer here. It also depends on what sort of dataset they have, whether it is a general text corpus for pre-training or a task specific dataset for fine-tuning. Are there examples on how to continue pre-training the model with the smaller domain specific corpus?

1 Like

Hey, there is a script I came across called run_mlm.py. This script helped me in fine-tuning, do take a look.

(transformers/examples/pytorch/language-modeling at main · huggingface/transformers · GitHub)

Hope this helps,
Parth