I’m trying to resume training using a checkpoint with RobertaForMaskedLM
I’m using the same script I trained except at the last stage I call trainer.train("checkpoint-200000")
, i.e. the model is created as usual from config and the tokenizer loaded from disk.
The trainer prints the message below, but then the loss isn’t consistent with the original training. At this stage in the original training the loss was around 0.6, on resume it drops to 0.0004 and then 7.85
This makes me not trust reloading the trained model as I have no confidence I’ve correctly trained and serialised it. What could I be doing wrong?
***** Running training *****
Num examples = 1061833
Num Epochs = 50
Instantaneous batch size per device = 256
Total train batch size (w. parallel, distributed & accumulation) = 256
Gradient Accumulation steps = 1
Total optimization steps = 207400
Continuing training from checkpoint, will skip to saved global_step
Continuing training from epoch 48
Continuing training from global step 200000
Will skip the first 896 batches in the first epoch
{‘loss’: 0.00040710282906264845, ‘learning_rate’: 1.7815814850530375e-06, ‘epoch’: 48.21841851494696}
{‘loss’: 7.853060150146485, ‘learning_rate’: 1.7791706846673097e-06, ‘epoch’: 48.22082931533269}
{‘loss’: 7.491885375976563, ‘learning_rate’: 1.7767598842815817e-06, ‘epoch’: 48.22324011571842}