Hello! I’m training roberta-base
for many hours with the huggingface Trainer
. Sometimes I had to manually stop and then restart the training using trainer.train(resume_from_checkpoint=True)
for issues related to the gpus I am using. The problem is that every time that I restart the training, I can see the loss jumping down both on validation and training set. See my wandb plot:
Have anyone had the same problems or do you know how to fix it?
2 Likes
Hi everyone, I’m new here. I was looking for a solution to this problem and came across this topic, I hope someone can help solve it. Good luck to everyone.
1 Like
Probably it is due to the fact the learning rate decreases along the epochs. when you restart, it is on its maximum value, causing the drop loss. It happens in many frameworks. Not sure about your settings