Hey! I am trying to continue training by loading a checkpoint. But for some reason, it always starts from scratch. Probably I am just missing something.
loading weights file .../models/checkpoint-2000/pytorch_model.bin
All model checkpoint weights were used when initializing EncoderDecoderModel.
***** Running training *****
Num examples = 222862
Num Epochs = 3
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 83574
I am missing some like:
Continuing training from checkpoint, will skip to saved global_step
Continuing training from epoch 0
Continuing training from global step 48000
Continuing training from 0 non-embedding floating-point operations
Will skip the first 48000 steps in the first epoch
With overwrite_output_dir=True you reset the output dir of your Trainer, which deletes the checkpoints. If you remove that option, it should resume from the lastest checkpoint.
Thanks for your fast response. Unfortunately, it is still not working. I have set overwrite_output_dir=False but the outcome is the same:
loading weights file /content/drive/MyDrive/output/training/roberta/checkpoint-59000/pytorch_model.bin
All model checkpoint weights were used when initializing EncoderDecoderModel.
All the weights of EncoderDecoderModel were initialized from the model checkpoint at /content/drive/MyDrive/output/training/roberta/checkpoint-59000.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncoderDecoderModel for predictions without further training.
PyTorch: setting up devices
Using amp fp16 backend
***** Running training *****
Num examples = 222862
Num Epochs = 3
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 83574
0% 50/83574 [00:20<9:30:55, 2.44it/s]
Probably I don’t understand something here. When resuming I pick the checkpoint path as the model path. That’s correct right?
I am a bit confused by the documentation:
overwrite_output_dir ( bool , optional, defaults to False ) – If True , overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.
Since I point to a checkpoint directory this should be set to True, right?
Sorry for so many questions. This is all very new to me.
Oh the documentation is outdated, you shouldn’t use your model from the checkpoint directory anymore, as long as the checkpoint is in the output_dir, the Trainer will use it if you do trainer.train(resume_from_checkpoint=True).
You can also pass the folder to your exact checkpoint instead of True.
Hello, can you elaborate on what you mean by saying “you shouldn’t use your model from the checkpoint directory anymore” ? If I want to continue training from my last checkpoint, how should I load my model using from_pretrained()?
Should I not pass the path to the checkpoint inside the from_pretrained() method?
Thanks for the very fast reply. I don’t understand how I must load my model in this case? I am using the Wav2Vec2ForCTC model, and I’ve fine-tuned it for 27 epochs, while initially I asked it to train for 100 epochs. But I had to stop it, and I have the checkpoint of the 27th epoch.
Am I supposed to load it using the initial model = Wav2Vec2ForCTC.from_pretrained('facebook/wav2vec2-large-xlsr-53') ?
If I do this, and then re-define the trainer & training arguments, then start training using the code you provided, will it over-write the pretrained loaded weights and use my saved ones?
This is from the documentation. Isn’t this what I am doing? When should I set this to True? I am sorry for the many questions, it’s just that I am not sure what is the proper order to continue from a checkpoint, as there are many different choices to take into account.
In particular, when I re-define my trainer and training arguments, if my output_dir is the one containing the checkpoint, shouldn’t I set the overwrite to True? Can I even set a completely different output_dir, that does not contain my final checkpoint?
The documentation is wrong in this case (it is very old so I’m guessins we forgot to update it). In any case:
overwrite_output_dir is only used in the example scripts, not the Trainer class iself, so its value is irrelevant if you are not using an example script
when using an example script it needs to be set to False to resume from a checkpoint.
@sgugger I am using trainer.train(resume_from_checkpoint=True) to train the model from last checkpoint but it starts from the beginning. I can see the checkpoints saved in the correct folder. I did earlier have overwrite_output_dir=True in my training args. I have removed it now but no avail.
Can you please comment on what could be going wrong here?
Maybe overwrite_output_dir=True deleted all your checkpoints, check the output directory! If this happened, you have to repeat the first part of the training and then execute the second part always keeping overwrite_output_dir=False.