I pre-trained a language model for my own data and I want to continue the pre-training for additional steps using the last checkpoint. I am planning to use the code below to continue the pre-training but want to be sure that everything is correct before starting.
Let’s say that I saved all of my files into CRoBERTa.
The Trainer will load the last checkpoint it can find, so it won’t necessarily be the one you specified. It will also resume the training from there with just the number of steps left, so it won’t be any different from the model you got at the end of your initial Trainer.train.
and it does load and train successfully, but when I check my logger (eg tensorboard), every time I train the epochs start from 0, and it’s annoying because the curves keep starting from the beginning when they should actually be back-to-back
am I doing something wrong? and is there a way to fix this?
I have one more questions which is unrelated and I found no answers for, is how do I save model checkpoints in the same format as trainer.train() does?
I know I can use model.save_pretrained('bert-base-uncased'), however this saves directly in that directory, unlike trainer.train() which saves in bert-base-uncased/checkpoint-100 …, I want a function that will automatically do this based on the current step count, does such a function exist?
Yes, I met the same situation. For example, the loss at checkpoint-100 was 0.20, but when I set resume from checkpoit to True, the training still start from step 1 and the loss is still 0.7, just as I was start from the beginning.
i just downloaded my checkpoint files from “checkpoint510” and upload it to another machine(same service in vast.ai) , also the resume from check point setup. But there is still a error of no valid checkpoint plz someone tell me why.
my_training_args = TrainingArguments(
report_to=“none”,
output_dir=“./expt2_train_student_gpt”,
num_train_epochs=100,
save_strategy = “epoch”,
…
…
push_to_hub=False,
) # Create a trainer for evaluation
scratch_trainer = Trainer(
model = scratch_train_model,
args = my_training_args,
…
)
and we initially started a 100-epoch training:
scratch_trainer.train()
We can extend it to 200 epoch in the following manner: redefine the trainer as follows and run this cell again (note: it has the increased epochs now): my_training_args = TrainingArguments(
report_to=“none”,
output_dir=“./expt2_train_student_gpt”,
num_train_epochs=200,
save_strategy = “epoch”,
…
…
push_to_hub=False,
) # Create a trainer for evaluation
scratch_trainer = Trainer(
model = scratch_train_model,
args = my_training_args,
…
)
resume training here:
# We are attempting a resume here
scratch_trainer.train(resume_from_checkpoint = True)
I have found that this starts slightly above the halfway (say from 53rd epoch for the earlier 100 epoch training and the loss value is quite good - I mean it does not start from the beginning loss value, but starts from a lower loss eventually converges to the earlier loss level. Then it continues training to the new desired epochs. In summary, we can recover around 50% of the training, unless there is a major change in training hyperparameters.