Loss jumps down when I restart the training from a checkpoint

gioaca · April 24, 2024, 7:19am

Hello! I’m training roberta-base for many hours with the huggingface Trainer. Sometimes I had to manually stop and then restart the training using trainer.train(resume_from_checkpoint=True) for issues related to the gpus I am using. The problem is that every time that I restart the training, I can see the loss jumping down both on validation and training set. See my wandb plot:

Have anyone had the same problems or do you know how to fix it?

blackngiht9 · April 26, 2024, 11:51am

Hi everyone, I’m new here. I was looking for a solution to this problem and came across this topic, I hope someone can help solve it. Good luck to everyone.

rps-dxdydz · April 28, 2024, 3:00am

Probably it is due to the fact the learning rate decreases along the epochs. when you restart, it is on its maximum value, causing the drop loss. It happens in many frameworks. Not sure about your settings

gioaca · April 29, 2024, 7:46am

I don’t think this is the problem. Look at the learning rate decay, it seems to go to 0 linearly.

Topic		Replies	Views
Cannot Resume Training Beginners	1	1374	December 15, 2020
Training Resumes with Increased Loss Despite Checkpoint Loading Beginners	0	89	September 5, 2024
Resuming Training from Checkpoints Stored on Hugging Face Hub (without downloading manually) 🤗Transformers	7	276	February 10, 2025
How to resume training from a checkpoint using huggingface trainer 🤗Transformers	5	171	May 8, 2025
Huggingface --resume_from_checkpoint feature with deepspeed Beginners	0	514	November 11, 2021

Loss jumps down when I restart the training from a checkpoint

Related topics