Cannot Resume Training

david-waterworth · December 14, 2020, 9:55pm

I’m trying to resume training using a checkpoint with RobertaForMaskedLM

I’m using the same script I trained except at the last stage I call trainer.train("checkpoint-200000"), i.e. the model is created as usual from config and the tokenizer loaded from disk.

The trainer prints the message below, but then the loss isn’t consistent with the original training. At this stage in the original training the loss was around 0.6, on resume it drops to 0.0004 and then 7.85

This makes me not trust reloading the trained model as I have no confidence I’ve correctly trained and serialised it. What could I be doing wrong?

***** Running training *****
Num examples = 1061833
Num Epochs = 50
Instantaneous batch size per device = 256
Total train batch size (w. parallel, distributed & accumulation) = 256
Gradient Accumulation steps = 1
Total optimization steps = 207400
Continuing training from checkpoint, will skip to saved global_step
Continuing training from epoch 48
Continuing training from global step 200000
Will skip the first 896 batches in the first epoch

{‘loss’: 0.00040710282906264845, ‘learning_rate’: 1.7815814850530375e-06, ‘epoch’: 48.21841851494696}
{‘loss’: 7.853060150146485, ‘learning_rate’: 1.7791706846673097e-06, ‘epoch’: 48.22082931533269}
{‘loss’: 7.491885375976563, ‘learning_rate’: 1.7767598842815817e-06, ‘epoch’: 48.22324011571842}

sgugger · December 15, 2020, 12:30am

The loss you were at isn’t saved so you can’t trust what it tells you when training resumes. First it’s not an average of all the losses since the beginning of the training, just since you restarted. Then the first time it’s logged, it’s divided by the wrong number (a number way too big) which is why you have that low loss.

I’ll look at this tomorrow and see if we can have the same losses printed in a full training and a resumed training.

Topic		Replies	Views
Resume training from checkpoint Beginners	1	3037	January 5, 2023
Training Resumes with Increased Loss Despite Checkpoint Loading Beginners	0	90	September 5, 2024
Resume Training with Lower Learning Rate Beginners	3	1329	January 5, 2025
Continuing Pre Training from Model Checkpoint Models	12	42173	January 13, 2025
Loss jumps down when I restart the training from a checkpoint Beginners	3	404	April 29, 2024

Cannot Resume Training

Related topics