Further pre-train roberta model

I have gone through this code train from scratch and understood how to pre-train a model from scratch. I have the following doubts in this code

  • What does block_size in LineByLineTextDataset represent?

  • If I want to further pretrain Robert-base model (instead of training from scratch) using my own corpus, what are the changes I have to make in the above code besides the following changes

from transformers import RobertaForMaskedLM, RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
model = RobertaForMaskedLM('roberta-base') 

I am aware that I need not to train tokenizer from scratch.
@thomwolf @julien-c

1 Like

Hi @mr-nlp, I think you can use the same run_language_modelling script to further pre-train roberta, just provide your own datasets.

block_size is used for max_length