Further pre-train roberta model

mr-nlp · July 14, 2020, 8:28am

I have gone through this code train from scratch and understood how to pre-train a model from scratch. I have the following doubts in this code

What does block_size in LineByLineTextDataset represent?
If I want to further pretrain Robert-base model (instead of training from scratch) using my own corpus, what are the changes I have to make in the above code besides the following changes

from transformers import RobertaForMaskedLM, RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
model = RobertaForMaskedLM('roberta-base')

I am aware that I need not to train tokenizer from scratch.
@thomwolf @julien-c

valhalla · July 14, 2020, 12:12pm

Hi @mr-nlp, I think you can use the same run_language_modelling script to further pre-train roberta, just provide your own datasets.

block_size is used for max_length

Topic		Replies	Views
Pre-Training From Scratch 🤗Transformers	0	1003	October 6, 2021
Training a domain-specific roberta from roberta-base Beginners	7	6096	February 2, 2021
Further pre-training the tokenizer? 🤗Tokenizers	0	821	April 30, 2022
Pretraining RoBERTa from scratch breaks down when using tokenizer with smaller vocabulary Beginners	2	1677	March 7, 2021
[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task 🤗Transformers	6	216	March 19, 2024

Further pre-train roberta model

Related topics