Continue fine-tuning with Trainer() after completing the initial training process

Hey all,

Let’s say I’ve fine-tuned a model after loading it using from_pretrained() for 40 epochs. After looking at my resulting plots, I can see that there’s still some room for improvement, and perhaps I could train it for a few more epochs.

I realize that in order to continue training, I have to use the code trainer.train(path_to_checkpoint). However, I don’t know how to specify the new number of epochs that I want it to continue training for. Because it’s already finished the 40 epochs I initially instructed it to train for.

Do I have to define a new trainer? But if I define a new trainer, can I also change the learning rate? In addition to these questions, there is also the learning rate scheduler. The default of the trainer is the OneCycleLR, if I’m not mistaken. This means that by the end of my 40 previous epochs, the learning rate was 0. By restarting the training process, will the whole scheduler restart as well?

Thanks for any help in advance.

3 Likes

Yes, you will need to restart a new training with new training arguments, since you are not resuming from a checkpoint.
The Trainer uses a linear decay by default, not the 1cycle policy, so you learning rate did end up at 0 at the end of the first training, and will restart at the value you set in your new training arguments.

Hi, thanks for answering.

Ah, so it’s more like restarting the training but from a check-point, not actually continuing entirely from where you left it, at least considering the learning rate’s values. I suppose I could continue the training by setting a very low learning rate, to approximate the values it would have were it to continue training normally from epoch 40. Regarding the scheduler, you are right, it is the linear decay but I think it also has optional warm-up steps, in which case it becomes OneCycleLR.

Also, considering the argument output_dir of the new TraininbArguments objects I will define to restart training: do I pass any path I want there? I can use a different path from the previous one, or the old one. If I use the old one, is it usual to use overwrite_output_dir=True so that it overwrites the old checkpoint?

Thanks in advance.

It depends what you want, but you can re-use the same output_dir if you don’t mind overwriting your old checkpoints.

1 Like

Alright, thank you. Have a nice day

By the way, @sgugger, in my case where I don’t want to actually continue training because the 40 epochs are completed, do I still pass a checkpoint path in the trainer or do I just train by just writing trainer.train()?

You should instantiate your model from the version trained, then launch trainer.train().

1 Like

Great, I’ll just load it using from_pretrained() and train it using new TrainingArguments. Thank you :slight_smile:

1 Like

@ThomasG
Hello, I was facing exact the same issue and found this topic.
In fact even the same number of epochs. After the 40 epochs (and 1 month, it was a big model), the learning rate reached 0 (after the warm-up and linear decay) but the loss seems that can continue to fall.

Let me ask you: Do you restart training using the same learning rate as the first training? You tested other values of learning rate? How they behave? How the changes in learning rate from starting a new training impacted the loss curve? (it continued the same decay of previous training?)
And at the ending of the second training, you loss loss curve keep showing that it can improve further? If so, you trained again?

Thanks in advance.

Hello.

I just tried to simulate the behavior of the scheduler, were it to have continued training after 40 epochs. I think I set the initial LR to a value similar to the final ones of the scheduler, and I did not include a warm-up since I only wanted it to decrease. This is in no way a robust solution, it was just a though to approximate what it would do.

The other questions you ask are very task-specific. Even if my loss curve showed signs that it can get better after the 2nd training ended, this does not mean yours would behave similarly. Good luck with your project!

1 Like