Continue fine-tuning with Trainer() after completing the initial training process

ThomasG · September 9, 2021, 9:54am

Hey all,

Let’s say I’ve fine-tuned a model after loading it using from_pretrained() for 40 epochs. After looking at my resulting plots, I can see that there’s still some room for improvement, and perhaps I could train it for a few more epochs.

I realize that in order to continue training, I have to use the code trainer.train(path_to_checkpoint). However, I don’t know how to specify the new number of epochs that I want it to continue training for. Because it’s already finished the 40 epochs I initially instructed it to train for.

Do I have to define a new trainer? But if I define a new trainer, can I also change the learning rate? In addition to these questions, there is also the learning rate scheduler. The default of the trainer is the OneCycleLR, if I’m not mistaken. This means that by the end of my 40 previous epochs, the learning rate was 0. By restarting the training process, will the whole scheduler restart as well?

Thanks for any help in advance.

sgugger · September 9, 2021, 2:43pm

Yes, you will need to restart a new training with new training arguments, since you are not resuming from a checkpoint.
The Trainer uses a linear decay by default, not the 1cycle policy, so you learning rate did end up at 0 at the end of the first training, and will restart at the value you set in your new training arguments.

ThomasG · September 9, 2021, 3:40pm

Hi, thanks for answering.

Ah, so it’s more like restarting the training but from a check-point, not actually continuing entirely from where you left it, at least considering the learning rate’s values. I suppose I could continue the training by setting a very low learning rate, to approximate the values it would have were it to continue training normally from epoch 40. Regarding the scheduler, you are right, it is the linear decay but I think it also has optional warm-up steps, in which case it becomes OneCycleLR.

Also, considering the argument output_dir of the new TraininbArguments objects I will define to restart training: do I pass any path I want there? I can use a different path from the previous one, or the old one. If I use the old one, is it usual to use overwrite_output_dir=True so that it overwrites the old checkpoint?

Thanks in advance.

sgugger · September 9, 2021, 4:28pm

It depends what you want, but you can re-use the same output_dir if you don’t mind overwriting your old checkpoints.

ThomasG · September 9, 2021, 4:29pm

Alright, thank you. Have a nice day

ThomasG · September 9, 2021, 4:58pm

By the way, @sgugger, in my case where I don’t want to actually continue training because the 40 epochs are completed, do I still pass a checkpoint path in the trainer or do I just train by just writing trainer.train()?

sgugger · September 9, 2021, 5:04pm

You should instantiate your model from the version trained, then launch trainer.train().

ThomasG · September 9, 2021, 5:10pm

Great, I’ll just load it using from_pretrained() and train it using new TrainingArguments. Thank you

jonathanalis · January 17, 2022, 4:00am

@ThomasG
Hello, I was facing exact the same issue and found this topic.
In fact even the same number of epochs. After the 40 epochs (and 1 month, it was a big model), the learning rate reached 0 (after the warm-up and linear decay) but the loss seems that can continue to fall.

Let me ask you: Do you restart training using the same learning rate as the first training? You tested other values of learning rate? How they behave? How the changes in learning rate from starting a new training impacted the loss curve? (it continued the same decay of previous training?)
And at the ending of the second training, you loss loss curve keep showing that it can improve further? If so, you trained again?

Thanks in advance.

ThomasG · January 19, 2022, 12:50pm

Hello.

I just tried to simulate the behavior of the scheduler, were it to have continued training after 40 epochs. I think I set the initial LR to a value similar to the final ones of the scheduler, and I did not include a warm-up since I only wanted it to decrease. This is in no way a robust solution, it was just a though to approximate what it would do.

The other questions you ask are very task-specific. Even if my loss curve showed signs that it can get better after the 2nd training ended, this does not mean yours would behave similarly. Good luck with your project!

Topic		Replies	Views
Resume Training with Lower Learning Rate Beginners	3	1332	January 5, 2025
How to make Trainer train the model one epoch at a time? 🤗Transformers	1	1826	March 29, 2022
Training models for smaller epochs and then continue trianing 🤗Transformers	5	1322	January 16, 2021
Resume Training, but reset epochs 🤗Transformers	0	939	September 16, 2022
Does starting training from a previous checkpoint reset the learning rate? 🤗Transformers	2	1665	August 20, 2022

Continue fine-tuning with Trainer() after completing the initial training process

Related topics