Continuing Pre Training from Model Checkpoint


I pre-trained a language model for my own data and I want to continue the pre-training for additional steps using the last checkpoint. I am planning to use the code below to continue the pre-training but want to be sure that everything is correct before starting.

Let’s say that I saved all of my files into CRoBERTa.

model = RobertaForMaskedLM.from_pretrained(‘CRoBERTa/checkpoint-…’)
tokenizer = RobertaTokenizerFast.from_pretrained(‘CRoBERTa’, max_len = 512, padding = ‘longest’)

training_args = TrainingArguments(overwrite_output_dir = False, …)
trainer = Trainer(…)

trainer.train(resume_from_checkpoint = True)

Is this pipeline correct ? Is there anything I am missing ?

If you use

trainer.train(resume_from_checkpoint = True)

The Trainer will load the last checkpoint it can find, so it won’t necessarily be the one you specified. It will also resume the training from there with just the number of steps left, so it won’t be any different from the model you got at the end of your initial Trainer.train.

1 Like

So in this case I don’t need to specify the checkpoint when loading the pre-trained model and the rest is good to go, right ?