How to check or manually control the learning rate used in training?

jonathanalis · October 17, 2021, 3:38am

Hello,

I want to continue training a pretrained model. The model was trained until some point but took too long to run (8h per epoch) and it has to be finished. But we realized that the loss curve is likely to keep decreasing, so we decided to keep training from the last saved checkpoint.

We are using the AutoModelForMaskedLM model, with an initial learning rate of 1e-4 and lr_scheduler_type=‘linear’.
Seems that the learning rate decreases along the epochs (right? I cant find in the tutorials and in the documentation the exact equation it is used to setting the learning rate along the epochs)

The losses for the last epochs in the loaded model were slowing decreasing below 0.55, and got at 0.546 when the model was saved.
However, when I started training, the loss went up to 0.6 after the first training epoch. I empirically tested a learning rate of 1e-6 and the loss went to 0.5454, an expected value.

So, I want to know if it is possible to get the values of the learning rates for each epoch the model was saved (it is saved anywhere in the checkpoint files?). Or at least log/print the learning rate in each training epoch. How to do that?
Is the learning rate restarting and I am losing all the progress the linear learning rate scheduler is calculating?

Also, what should I do to continue training with exactly the same learning rate as the original training had never stopped?
Seems that I have to pass the return of the function
transformers.get_linear_schedule_with_warmup()
to something in the trainer, but in order to do so, I need to get the optimizer from the trainer (another thing I don’t know how to do). Any ideas on how to do that?

Lastly, suppose I want to set a learning rate for each epoch, how to communicate it to the optimizer/trainer to use in trainer.train()? In other words, how to manually set the learning rate?

Thank you.

The code to load the is something like this (here I do not show all the training arguments, only the relevant for the question):

from transformers import Trainer, TrainingArguments, AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
training_args = TrainingArguments(
        learning_rate=1e-4,
        lr_scheduler_type='linear',
        warmup_steps = 0
        warmup_ratio = 0.1
    )
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    tokenizer=tokenizer
)

trainer.train(from_pretrained=model_checkpoint)

deathcrush · May 6, 2022, 7:34am

My feeling here is that the trainer saves the the scheduler and optimizer state and that upon training restart from a given checkpoint it should continue the learning rate decay from where it left off (since the learning rate is part of the optimizer state and its annealing depends on the scheduler state which should also be loaded.

Regarding manual setting of the learning rate - you have to go to the pytorch documentation here. I think the LambdaLR scheduler would do what you want if parametrized appropriately - you can simply get your lr_lambda function to return the correct learning rate for each epoch (be careful that the last_epoch property of this class increments every time .step() is called so you need to know how many steps in an epoch to return the correct value).

Regarding logging - it should be possible to write a callback that logs the learning rate but I have not tried this. Will update if I have time.

Hope this helps!

Topic		Replies	Views
Trainer: How to find the best learning rate? Beginners	0	1140	February 23, 2023
Cannot Resume Training Beginners	1	1374	December 15, 2020
How to adjust the learning rate after N number of epochs? Beginners	1	779	August 10, 2021
Which parameter is causing the decrease in Learning rate every epoch? Beginners	2	1129	December 21, 2021
Resume Training with Lower Learning Rate Beginners	3	1321	January 5, 2025

How to check or manually control the learning rate used in training?

Related topics