Let’s say I’ve fine-tuned a model after loading it using from_pretrained() for 40 epochs. After looking at my resulting plots, I can see that there’s still some room for improvement, and perhaps I could train it for a few more epochs.
I realize that in order to continue training, I have to use the code trainer.train(path_to_checkpoint). However, I don’t know how to specify the new number of epochs that I want it to continue training for. Because it’s already finished the 40 epochs I initially instructed it to train for.
Do I have to define a new trainer? But if I define a new trainer, can I also change the learning rate? In addition to these questions, there is also the learning rate scheduler. The default of the trainer is the OneCycleLR, if I’m not mistaken. This means that by the end of my 40 previous epochs, the learning rate was 0. By restarting the training process, will the whole scheduler restart as well?
Yes, you will need to restart a new training with new training arguments, since you are not resuming from a checkpoint.
The Trainer uses a linear decay by default, not the 1cycle policy, so you learning rate did end up at 0 at the end of the first training, and will restart at the value you set in your new training arguments.
Ah, so it’s more like restarting the training but from a check-point, not actually continuing entirely from where you left it, at least considering the learning rate’s values. I suppose I could continue the training by setting a very low learning rate, to approximate the values it would have were it to continue training normally from epoch 40. Regarding the scheduler, you are right, it is the linear decay but I think it also has optional warm-up steps, in which case it becomes OneCycleLR.
Also, considering the argument output_dir of the newTraininbArguments objects I will define to restart training: do I pass any path I want there? I can use a different path from the previous one, or the old one. If I use the old one, is it usual to use overwrite_output_dir=True so that it overwrites the old checkpoint?
By the way, @sgugger, in my case where I don’t want to actually continue training because the 40 epochs are completed, do I still pass a checkpoint path in the trainer or do I just train by just writing trainer.train()?
@ThomasG
Hello, I was facing exact the same issue and found this topic.
In fact even the same number of epochs. After the 40 epochs (and 1 month, it was a big model), the learning rate reached 0 (after the warm-up and linear decay) but the loss seems that can continue to fall.
Let me ask you: Do you restart training using the same learning rate as the first training? You tested other values of learning rate? How they behave? How the changes in learning rate from starting a new training impacted the loss curve? (it continued the same decay of previous training?)
And at the ending of the second training, you loss loss curve keep showing that it can improve further? If so, you trained again?
I just tried to simulate the behavior of the scheduler, were it to have continued training after 40 epochs. I think I set the initial LR to a value similar to the final ones of the scheduler, and I did not include a warm-up since I only wanted it to decrease. This is in no way a robust solution, it was just a though to approximate what it would do.
The other questions you ask are very task-specific. Even if my loss curve showed signs that it can get better after the 2nd training ended, this does not mean yours would behave similarly. Good luck with your project!