Loss jumps down when I restart the training from a checkpoint

Hello! I’m training roberta-base for many hours with the huggingface Trainer. Sometimes I had to manually stop and then restart the training using trainer.train(resume_from_checkpoint=True) for issues related to the gpus I am using. The problem is that every time that I restart the training, I can see the loss jumping down both on validation and training set. See my wandb plot:
image
Have anyone had the same problems or do you know how to fix it?

2 Likes

Hi everyone, I’m new here. I was looking for a solution to this problem and came across this topic, I hope someone can help solve it. Good luck to everyone.

1 Like

Probably it is due to the fact the learning rate decreases along the epochs. when you restart, it is on its maximum value, causing the drop loss. It happens in many frameworks. Not sure about your settings

I don’t think this is the problem. Look at the learning rate decay, it seems to go to 0 linearly.