Cannot Resume Training

I’m trying to resume training using a checkpoint with RobertaForMaskedLM

I’m using the same script I trained except at the last stage I call trainer.train("checkpoint-200000"), i.e. the model is created as usual from config and the tokenizer loaded from disk.

The trainer prints the message below, but then the loss isn’t consistent with the original training. At this stage in the original training the loss was around 0.6, on resume it drops to 0.0004 and then 7.85

This makes me not trust reloading the trained model as I have no confidence I’ve correctly trained and serialised it. What could I be doing wrong?

***** Running training *****
Num examples = 1061833
Num Epochs = 50
Instantaneous batch size per device = 256
Total train batch size (w. parallel, distributed & accumulation) = 256
Gradient Accumulation steps = 1
Total optimization steps = 207400
Continuing training from checkpoint, will skip to saved global_step
Continuing training from epoch 48
Continuing training from global step 200000
Will skip the first 896 batches in the first epoch

{‘loss’: 0.00040710282906264845, ‘learning_rate’: 1.7815814850530375e-06, ‘epoch’: 48.21841851494696}
{‘loss’: 7.853060150146485, ‘learning_rate’: 1.7791706846673097e-06, ‘epoch’: 48.22082931533269}
{‘loss’: 7.491885375976563, ‘learning_rate’: 1.7767598842815817e-06, ‘epoch’: 48.22324011571842}

1 Like

The loss you were at isn’t saved so you can’t trust what it tells you when training resumes. First it’s not an average of all the losses since the beginning of the training, just since you restarted. Then the first time it’s logged, it’s divided by the wrong number (a number way too big) which is why you have that low loss.

I’ll look at this tomorrow and see if we can have the same losses printed in a full training and a resumed training.

1 Like