No skipping steps after loading from checkpoint

Hey! I am trying to continue training by loading a checkpoint. But for some reason, it always starts from scratch. Probably I am just missing something.

training_arguments = Seq2SeqTrainingArguments(
            predict_with_generate=True,
            evaluation_strategy='steps',
            per_device_train_batch_size=training_config['per_device_train_batch_size'],
            per_device_eval_batch_size=training_config['per_device_eval_batch_size'],
            fp16=True,
            output_dir=training_output_path,
            overwrite_output_dir=True,
            logging_steps=training_config['logging_steps'],
            save_steps=training_config['save_steps'],
            eval_steps=training_config['eval_steps'],
            warmup_steps=training_config['warmup_steps'],
            metric_for_best_model='eval_loss',
            greater_is_better=False)

trainer = Seq2SeqTrainer(
            model=model,
            tokenizer=tokenizer,
            args=training_arguments,
            compute_metrics=compute_metrics,
            train_dataset=train_ds,
            eval_dataset=eval_ds,
        )

Here are the logs:

loading weights file .../models/checkpoint-2000/pytorch_model.bin
All model checkpoint weights were used when initializing EncoderDecoderModel.
***** Running training *****
  Num examples = 222862
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 83574

I am missing some like:

Continuing training from checkpoint, will skip to saved global_step
Continuing training from epoch 0
Continuing training from global step 48000
Continuing training from 0 non-embedding floating-point operations
Will skip the first 48000 steps in the first epoch

Which I found here: Load from checkpoint not skipping steps - :hugs:Transformers - Hugging Face Forums

Maybe somebody can help me? Thank you in advance!

1 Like

With overwrite_output_dir=True you reset the output dir of your Trainer, which deletes the checkpoints. If you remove that option, it should resume from the lastest checkpoint.

1 Like

Thanks for your fast response. Unfortunately, it is still not working. I have set overwrite_output_dir=False but the outcome is the same:

loading weights file /content/drive/MyDrive/output/training/roberta/checkpoint-59000/pytorch_model.bin
All model checkpoint weights were used when initializing EncoderDecoderModel.

All the weights of EncoderDecoderModel were initialized from the model checkpoint at /content/drive/MyDrive/output/training/roberta/checkpoint-59000.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncoderDecoderModel for predictions without further training.
PyTorch: setting up devices
Using amp fp16 backend
***** Running training *****
  Num examples = 222862
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 83574
  0% 50/83574 [00:20<9:30:55,  2.44it/s]

Probably I don’t understand something here. When resuming I pick the checkpoint path as the model path. That’s correct right?

I am a bit confused by the documentation:

overwrite_output_dir ( bool , optional, defaults to False ) – If True , overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.

Since I point to a checkpoint directory this should be set to True, right?

Sorry for so many questions. This is all very new to me.

Oh the documentation is outdated, you shouldn’t use your model from the checkpoint directory anymore, as long as the checkpoint is in the output_dir, the Trainer will use it if you do trainer.train(resume_from_checkpoint=True).

You can also pass the folder to your exact checkpoint instead of True.

2 Likes

Thanks a lot. It works like charm!

Hello, can you elaborate on what you mean by saying “you shouldn’t use your model from the checkpoint directory anymore” ? If I want to continue training from my last checkpoint, how should I load my model using from_pretrained()?

Should I not pass the path to the checkpoint inside the from_pretrained() method?

Thanks in advance.

No, you should just do:

trainer.train(resume_from_checkpoint=True)

as I said earlier. It will load the weights in your model.

Thanks for the very fast reply. I don’t understand how I must load my model in this case? I am using the Wav2Vec2ForCTC model, and I’ve fine-tuned it for 27 epochs, while initially I asked it to train for 100 epochs. But I had to stop it, and I have the checkpoint of the 27th epoch.

Am I supposed to load it using the initial model = Wav2Vec2ForCTC.from_pretrained('facebook/wav2vec2-large-xlsr-53') ?

If I do this, and then re-define the trainer & training arguments, then start training using the code you provided, will it over-write the pretrained loaded weights and use my saved ones?

Thanks!

You should use your model the same way you did before, as I said the trainer will load the weights from the last checkpoint.

Thank you. Please forgive my ignorance, but just to make sure I understand everything correctly, the steps are as follows:

  1. Load the model (using the typical from_pretrained I showed above) and the trainer/arguments as before.
  2. If the output_dir of my arguments contains the checkpoint, use overwrite_output_dir=True.
  3. Use trainer.train(resume_from_checkpoint=True)

This will continue training the model for the remainder of the epochs defined in my arguments, and will load the weights of my 27th epoch.

Does everything sound correct?

No you should absolutely not use overwrite_output_dir, as it will overwrite on your checkpoints.

This is from the documentation. Isn’t this what I am doing? When should I set this to True? I am sorry for the many questions, it’s just that I am not sure what is the proper order to continue from a checkpoint, as there are many different choices to take into account.

In particular, when I re-define my trainer and training arguments, if my output_dir is the one containing the checkpoint, shouldn’t I set the overwrite to True? Can I even set a completely different output_dir, that does not contain my final checkpoint?

The documentation is wrong in this case (it is very old so I’m guessins we forgot to update it). In any case:

  • overwrite_output_dir is only used in the example scripts, not the Trainer class iself, so its value is irrelevant if you are not using an example script
  • when using an example script it needs to be set to False to resume from a checkpoint.

I see. Thanks a lot @sgugger. In any case before I attempt anything, I will create a copy of my checkpoint just to be safe! :smiley:

Have a great day!

@sgugger I am using trainer.train(resume_from_checkpoint=True) to train the model from last checkpoint but it starts from the beginning. I can see the checkpoints saved in the correct folder. I did earlier have overwrite_output_dir=True in my training args. I have removed it now but no avail.

Can you please comment on what could be going wrong here?

Maybe overwrite_output_dir=True deleted all your checkpoints, check the output directory! If this happened, you have to repeat the first part of the training and then execute the second part always keeping overwrite_output_dir=False.

Actually, I have restarted everything, and now everything works fine and datapoints are being skipped - ignore me please. Thanks!