I pre-trained a language model for my own data and I want to continue the pre-training for additional steps using the last checkpoint. I am planning to use the code below to continue the pre-training but want to be sure that everything is correct before starting.
Let’s say that I saved all of my files into CRoBERTa.
model = RobertaForMaskedLM.from_pretrained(‘CRoBERTa/checkpoint-…’)
tokenizer = RobertaTokenizerFast.from_pretrained(‘CRoBERTa’, max_len = 512, padding = ‘longest’)
training_args = TrainingArguments(overwrite_output_dir = False, …)
trainer = Trainer(…)
trainer.train(resume_from_checkpoint = True)
Is this pipeline correct ? Is there anything I am missing ?