No skipping steps after loading from checkpoint

stoffy · June 16, 2021, 7:04am

Hey! I am trying to continue training by loading a checkpoint. But for some reason, it always starts from scratch. Probably I am just missing something.

training_arguments = Seq2SeqTrainingArguments(
            predict_with_generate=True,
            evaluation_strategy='steps',
            per_device_train_batch_size=training_config['per_device_train_batch_size'],
            per_device_eval_batch_size=training_config['per_device_eval_batch_size'],
            fp16=True,
            output_dir=training_output_path,
            overwrite_output_dir=True,
            logging_steps=training_config['logging_steps'],
            save_steps=training_config['save_steps'],
            eval_steps=training_config['eval_steps'],
            warmup_steps=training_config['warmup_steps'],
            metric_for_best_model='eval_loss',
            greater_is_better=False)

trainer = Seq2SeqTrainer(
            model=model,
            tokenizer=tokenizer,
            args=training_arguments,
            compute_metrics=compute_metrics,
            train_dataset=train_ds,
            eval_dataset=eval_ds,
        )

Here are the logs:

loading weights file .../models/checkpoint-2000/pytorch_model.bin
All model checkpoint weights were used when initializing EncoderDecoderModel.
***** Running training *****
  Num examples = 222862
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 83574

I am missing some like:

Continuing training from checkpoint, will skip to saved global_step
Continuing training from epoch 0
Continuing training from global step 48000
Continuing training from 0 non-embedding floating-point operations
Will skip the first 48000 steps in the first epoch

Which I found here: Load from checkpoint not skipping steps - Transformers - Hugging Face Forums

Maybe somebody can help me? Thank you in advance!

sgugger · June 16, 2021, 12:59pm

With overwrite_output_dir=True you reset the output dir of your Trainer, which deletes the checkpoints. If you remove that option, it should resume from the lastest checkpoint.

stoffy · June 17, 2021, 5:23am

Thanks for your fast response. Unfortunately, it is still not working. I have set overwrite_output_dir=False but the outcome is the same:

loading weights file /content/drive/MyDrive/output/training/roberta/checkpoint-59000/pytorch_model.bin
All model checkpoint weights were used when initializing EncoderDecoderModel.

All the weights of EncoderDecoderModel were initialized from the model checkpoint at /content/drive/MyDrive/output/training/roberta/checkpoint-59000.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncoderDecoderModel for predictions without further training.
PyTorch: setting up devices
Using amp fp16 backend
***** Running training *****
  Num examples = 222862
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 83574
  0% 50/83574 [00:20<9:30:55,  2.44it/s]

Probably I don’t understand something here. When resuming I pick the checkpoint path as the model path. That’s correct right?

I am a bit confused by the documentation:

overwrite_output_dir ( bool , optional, defaults to False ) – If True , overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.

Since I point to a checkpoint directory this should be set to True, right?

Sorry for so many questions. This is all very new to me.

sgugger · June 17, 2021, 11:36am

Oh the documentation is outdated, you shouldn’t use your model from the checkpoint directory anymore, as long as the checkpoint is in the output_dir, the Trainer will use it if you do trainer.train(resume_from_checkpoint=True).

You can also pass the folder to your exact checkpoint instead of True.

stoffy · June 17, 2021, 6:30pm

Thanks a lot. It works like charm!

ThomasG · October 1, 2021, 11:40am

Hello, can you elaborate on what you mean by saying “you shouldn’t use your model from the checkpoint directory anymore” ? If I want to continue training from my last checkpoint, how should I load my model using from_pretrained()?

Should I not pass the path to the checkpoint inside the from_pretrained() method?

Thanks in advance.

sgugger · October 1, 2021, 11:57am

No, you should just do:

trainer.train(resume_from_checkpoint=True)

as I said earlier. It will load the weights in your model.

ThomasG · October 1, 2021, 12:10pm

Thanks for the very fast reply. I don’t understand how I must load my model in this case? I am using the Wav2Vec2ForCTC model, and I’ve fine-tuned it for 27 epochs, while initially I asked it to train for 100 epochs. But I had to stop it, and I have the checkpoint of the 27th epoch.

Am I supposed to load it using the initial model = Wav2Vec2ForCTC.from_pretrained('facebook/wav2vec2-large-xlsr-53') ?

If I do this, and then re-define the trainer & training arguments, then start training using the code you provided, will it over-write the pretrained loaded weights and use my saved ones?

Thanks!

sgugger · October 1, 2021, 12:13pm

You should use your model the same way you did before, as I said the trainer will load the weights from the last checkpoint.

ThomasG · October 1, 2021, 12:23pm

Thank you. Please forgive my ignorance, but just to make sure I understand everything correctly, the steps are as follows:

Load the model (using the typical from_pretrained I showed above) and the trainer/arguments as before.
If the output_dir of my arguments contains the checkpoint, use overwrite_output_dir=True.
Use trainer.train(resume_from_checkpoint=True)

This will continue training the model for the remainder of the epochs defined in my arguments, and will load the weights of my 27th epoch.

Does everything sound correct?

sgugger · October 1, 2021, 1:06pm

No you should absolutely not use overwrite_output_dir, as it will overwrite on your checkpoints.

ThomasG · October 1, 2021, 1:10pm

This is from the documentation. Isn’t this what I am doing? When should I set this to True? I am sorry for the many questions, it’s just that I am not sure what is the proper order to continue from a checkpoint, as there are many different choices to take into account.

In particular, when I re-define my trainer and training arguments, if my output_dir is the one containing the checkpoint, shouldn’t I set the overwrite to True? Can I even set a completely different output_dir, that does not contain my final checkpoint?

sgugger · October 1, 2021, 1:25pm

The documentation is wrong in this case (it is very old so I’m guessins we forgot to update it). In any case:

overwrite_output_dir is only used in the example scripts, not the Trainer class iself, so its value is irrelevant if you are not using an example script
when using an example script it needs to be set to False to resume from a checkpoint.

ThomasG · October 1, 2021, 1:34pm

I see. Thanks a lot @sgugger. In any case before I attempt anything, I will create a copy of my checkpoint just to be safe!

Have a great day!

viv-om · February 24, 2022, 10:45am

@sgugger I am using trainer.train(resume_from_checkpoint=True) to train the model from last checkpoint but it starts from the beginning. I can see the checkpoints saved in the correct folder. I did earlier have overwrite_output_dir=True in my training args. I have removed it now but no avail.

Can you please comment on what could be going wrong here?

lucadini · March 3, 2022, 9:37am

Maybe overwrite_output_dir=True deleted all your checkpoints, check the output directory! If this happened, you have to repeat the first part of the training and then execute the second part always keeping overwrite_output_dir=False.

dashapyly · April 21, 2022, 3:52pm

Actually, I have restarted everything, and now everything works fine and datapoints are being skipped - ignore me please. Thanks!

Topic		Replies	Views
How to continue training and not overwrite checkpoint number? 🤗Transformers	2	1633	November 2, 2022
Continuing Pre Training from Model Checkpoint Models	12	42331	January 13, 2025
Load from checkpoint not skipping steps 🤗Transformers	7	3643	April 17, 2023
Resume training from checkpoint Beginners	1	3042	January 5, 2023
How to resume training from checkpoint Models	0	592	April 11, 2024

No skipping steps after loading from checkpoint

Related topics