No skipping steps after loading from checkpoint

Hey! I am trying to continue training by loading a checkpoint. But for some reason, it always starts from scratch. Probably I am just missing something.

training_arguments = Seq2SeqTrainingArguments(
            predict_with_generate=True,
            evaluation_strategy='steps',
            per_device_train_batch_size=training_config['per_device_train_batch_size'],
            per_device_eval_batch_size=training_config['per_device_eval_batch_size'],
            fp16=True,
            output_dir=training_output_path,
            overwrite_output_dir=True,
            logging_steps=training_config['logging_steps'],
            save_steps=training_config['save_steps'],
            eval_steps=training_config['eval_steps'],
            warmup_steps=training_config['warmup_steps'],
            metric_for_best_model='eval_loss',
            greater_is_better=False)

trainer = Seq2SeqTrainer(
            model=model,
            tokenizer=tokenizer,
            args=training_arguments,
            compute_metrics=compute_metrics,
            train_dataset=train_ds,
            eval_dataset=eval_ds,
        )

Here are the logs:

loading weights file .../models/checkpoint-2000/pytorch_model.bin
All model checkpoint weights were used when initializing EncoderDecoderModel.
***** Running training *****
  Num examples = 222862
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 83574

I am missing some like:

Continuing training from checkpoint, will skip to saved global_step
Continuing training from epoch 0
Continuing training from global step 48000
Continuing training from 0 non-embedding floating-point operations
Will skip the first 48000 steps in the first epoch

Which I found here: Load from checkpoint not skipping steps - :hugs:Transformers - Hugging Face Forums

Maybe somebody can help me? Thank you in advance!

With overwrite_output_dir=True you reset the output dir of your Trainer, which deletes the checkpoints. If you remove that option, it should resume from the lastest checkpoint.

Thanks for your fast response. Unfortunately, it is still not working. I have set overwrite_output_dir=False but the outcome is the same:

loading weights file /content/drive/MyDrive/output/training/roberta/checkpoint-59000/pytorch_model.bin
All model checkpoint weights were used when initializing EncoderDecoderModel.

All the weights of EncoderDecoderModel were initialized from the model checkpoint at /content/drive/MyDrive/output/training/roberta/checkpoint-59000.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncoderDecoderModel for predictions without further training.
PyTorch: setting up devices
Using amp fp16 backend
***** Running training *****
  Num examples = 222862
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 83574
  0% 50/83574 [00:20<9:30:55,  2.44it/s]

Probably I don’t understand something here. When resuming I pick the checkpoint path as the model path. That’s correct right?

I am a bit confused by the documentation:

overwrite_output_dir ( bool , optional, defaults to False ) – If True , overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.

Since I point to a checkpoint directory this should be set to True, right?

Sorry for so many questions. This is all very new to me.

Oh the documentation is outdated, you shouldn’t use your model from the checkpoint directory anymore, as long as the checkpoint is in the output_dir, the Trainer will use it if you do trainer.train(resume_from_checkpoint=True).

You can also pass the folder to your exact checkpoint instead of True.

1 Like

Thanks a lot. It works like charm!