I have gone through this code train from scratch and understood how to pre-train a model from scratch. I have the following doubts in this code
-
What does
block_size
inLineByLineTextDataset
represent? -
If I want to further pretrain Robert-base model (instead of training from scratch) using my own corpus, what are the changes I have to make in the above code besides the following changes
from transformers import RobertaForMaskedLM, RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
model = RobertaForMaskedLM('roberta-base')
I am aware that I need not to train tokenizer from scratch.
@thomwolf @julien-c