Continuing Pre Training from Model Checkpoint

Hi,

I pre-trained a language model for my own data and I want to continue the pre-training for additional steps using the last checkpoint. I am planning to use the code below to continue the pre-training but want to be sure that everything is correct before starting.

Let’s say that I saved all of my files into CRoBERTa.

model = RobertaForMaskedLM.from_pretrained(‘CRoBERTa/checkpoint-…’)
tokenizer = RobertaTokenizerFast.from_pretrained(‘CRoBERTa’, max_len = 512, padding = ‘longest’)

training_args = TrainingArguments(overwrite_output_dir = False, …)
trainer = Trainer(…)

trainer.train(resume_from_checkpoint = True)

Is this pipeline correct ? Is there anything I am missing ?

2 Likes

If you use

trainer.train(resume_from_checkpoint = True)

The Trainer will load the last checkpoint it can find, so it won’t necessarily be the one you specified. It will also resume the training from there with just the number of steps left, so it won’t be any different from the model you got at the end of your initial Trainer.train.

6 Likes

So in this case I don’t need to specify the checkpoint when loading the pre-trained model and the rest is good to go, right ?

ok I’ve tried

trainer.train(resume_from_checkpoint = True)

and it does load and train successfully, but when I check my logger (eg tensorboard), every time I train the epochs start from 0, and it’s annoying because the curves keep starting from the beginning when they should actually be back-to-back

am I doing something wrong? and is there a way to fix this?

I have one more questions which is unrelated and I found no answers for, is how do I save model checkpoints in the same format as trainer.train() does?

I know I can use model.save_pretrained('bert-base-uncased'), however this saves directly in that directory, unlike trainer.train() which saves in bert-base-uncased/checkpoint-100 …, I want a function that will automatically do this based on the current step count, does such a function exist?

I know that we can continue training with this:

trainer.train("checkpoint-9500")

But for example, i wanted to change part of a model (for example wanted to change embedding size of model.) How can i do that? Is it possible?

did you able to solve it?

Yes, I met the same situation. For example, the loss at checkpoint-100 was 0.20, but when I set resume from checkpoit to True, the training still start from step 1 and the loss is still 0.7, just as I was start from the beginning.

i just downloaded my checkpoint files from “checkpoint510” and upload it to another machine(same service in vast.ai) , also the resume from check point setup. But there is still a error of no valid checkpoint plz someone tell me why.

I am able to train from the best checkpoint but the model does not write down new checkpoints after training completes. why?

I meet a problem about mismatch in loss when using torch.load and resume_from_checkpoint, here my detailed problem description. python - Mismatch in Loss When Using torch.load and resume_from_checkpoint - Stack Overflow. Expecting your reply.