Continuing Pre Training from Model Checkpoint

fozyurt · November 5, 2021, 2:34pm

Hi,

I pre-trained a language model for my own data and I want to continue the pre-training for additional steps using the last checkpoint. I am planning to use the code below to continue the pre-training but want to be sure that everything is correct before starting.

Let’s say that I saved all of my files into CRoBERTa.

model = RobertaForMaskedLM.from_pretrained(‘CRoBERTa/checkpoint-…’)
tokenizer = RobertaTokenizerFast.from_pretrained(‘CRoBERTa’, max_len = 512, padding = ‘longest’)

training_args = TrainingArguments(overwrite_output_dir = False, …)
trainer = Trainer(…)

trainer.train(resume_from_checkpoint = True)

Is this pipeline correct ? Is there anything I am missing ?

sgugger · November 5, 2021, 2:39pm

If you use

trainer.train(resume_from_checkpoint = True)

The Trainer will load the last checkpoint it can find, so it won’t necessarily be the one you specified. It will also resume the training from there with just the number of steps left, so it won’t be any different from the model you got at the end of your initial Trainer.train.

fozyurt · November 5, 2021, 2:41pm

So in this case I don’t need to specify the checkpoint when loading the pre-trained model and the rest is good to go, right ?

FarisHijazi · August 9, 2022, 2:51pm

ok I’ve tried

trainer.train(resume_from_checkpoint = True)

and it does load and train successfully, but when I check my logger (eg tensorboard), every time I train the epochs start from 0, and it’s annoying because the curves keep starting from the beginning when they should actually be back-to-back

am I doing something wrong? and is there a way to fix this?

FarisHijazi · August 9, 2022, 2:54pm

I have one more questions which is unrelated and I found no answers for, is how do I save model checkpoints in the same format as trainer.train() does?

I know I can use model.save_pretrained('bert-base-uncased'), however this saves directly in that directory, unlike trainer.train() which saves in bert-base-uncased/checkpoint-100 …, I want a function that will automatically do this based on the current step count, does such a function exist?

canovich · August 30, 2022, 3:21am

I know that we can continue training with this:

trainer.train("checkpoint-9500")

But for example, i wanted to change part of a model (for example wanted to change embedding size of model.) How can i do that? Is it possible?

thasan · November 5, 2023, 9:12am

did you able to solve it?

CyrisHe · December 15, 2023, 8:51am

Yes, I met the same situation. For example, the loss at checkpoint-100 was 0.20, but when I set resume from checkpoit to True, the training still start from step 1 and the loss is still 0.7, just as I was start from the beginning.

stevenaaa1207 · January 28, 2024, 2:12am

i just downloaded my checkpoint files from “checkpoint510” and upload it to another machine(same service in vast.ai) , also the resume from check point setup. But there is still a error of no valid checkpoint plz someone tell me why.

ghashami · March 2, 2024, 11:19pm

I am able to train from the best checkpoint but the model does not write down new checkpoints after training completes. why?

k253 · April 16, 2024, 6:57am

I meet a problem about mismatch in loss when using torch.load and resume_from_checkpoint, here my detailed problem description. python - Mismatch in Loss When Using torch.load and resume_from_checkpoint - Stack Overflow. Expecting your reply.

dlface · June 4, 2024, 3:20am

I meet the same problem. resume trainning works, but train loss and eval loss both increased.

qswadey · January 13, 2025, 11:36am

If our trainer is defined as:

my_training_args = TrainingArguments(
report_to=“none”,
output_dir=“./expt2_train_student_gpt”,
num_train_epochs=100,
save_strategy = “epoch”,
…
…
push_to_hub=False,
)
# Create a trainer for evaluation
scratch_trainer = Trainer(
model = scratch_train_model,
args = my_training_args,
…
)

and we initially started a 100-epoch training:

scratch_trainer.train()

We can extend it to 200 epoch in the following manner:
redefine the trainer as follows and run this cell again (note: it has the increased epochs now):
my_training_args = TrainingArguments(
report_to=“none”,
output_dir=“./expt2_train_student_gpt”,
num_train_epochs=200,
save_strategy = “epoch”,
…
…
push_to_hub=False,
)
# Create a trainer for evaluation
scratch_trainer = Trainer(
model = scratch_train_model,
args = my_training_args,
…
)

resume training here:

# We are attempting a resume here
scratch_trainer.train(resume_from_checkpoint = True)

I have found that this starts slightly above the halfway (say from 53rd epoch for the earlier 100 epoch training and the loss value is quite good - I mean it does not start from the beginning loss value, but starts from a lower loss eventually converges to the earlier loss level. Then it continues training to the new desired epochs. In summary, we can recover around 50% of the training, unless there is a major change in training hyperparameters.

Topic		Replies	Views
Resume training from checkpoint Beginners	1	3043	January 5, 2023
Does "resume_from_checkpoint" work? Beginners	0	970	June 19, 2022
No skipping steps after loading from checkpoint 🤗Transformers	16	7546	April 21, 2022
How to resume training from checkpoint Models	0	593	April 11, 2024
Loading model from checkpoint after error in training Beginners	9	41651	May 2, 2024

Continuing Pre Training from Model Checkpoint

Related topics