Trainer optimizer

Elidor00 · November 20, 2020, 10:19am

Hi everyone,

in my code I instantiate a trainer as follows:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

I don’t specify anything in the “optimizers” field as I’ve always used the default one (AdamW).
I tried to create an optimizer instance similar to the default one so I could try to change the learning rate (lr).
The code I used to simulate the default optimizer is the following:

no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
    {
        "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]
optimizer = AdamW(optimizer_grouped_parameters, lr=5e-05, eps=1e-08)
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=0, num_training_steps=750
)

and then pass the parameters optimizer and scheduler to the optimizer field of the trainer.

The problem is that with this definition of optimizer, I have different results than the default one (even if I thought they were identical). What should I change to create an optimizer identical to the default one, but where I can change the lr directly from my code?

Thanks!

sgugger · November 20, 2020, 2:15pm

At first glance, it might be linked to the number of training_steps? Are you sure your other training does 750 steps? Also I don’t know what your training_args are but if any of them don’t use the default value, that could also be the reason for the change in results.

Elidor00 · November 20, 2020, 5:43pm

Yes, the num_training_steps is 750 sure. The training_args are the default transformers that are at this link.

The code is:

from transformers import HfArgumentParser, Trainer, TrainingArguments

parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()

P.s. ModelArguments contains the arguments pertaining to which model/config/tokenizer we are going to fine-tune from. While DataTrainingArguments contains the arguments pertaining to what data we are going to input our model for training and eval.

Elidor00 · November 23, 2020, 9:37am

@sgugger One thing I noticed, doing other experiments, is that using the optimizer as I have implemented it, after some epochs the learning rate reaches 0.0, while with the default one I always have a value > 0.

I don’t know if this information can be used to understand how my implementation differs from the default one. They look really identical to me … I have no idea…

sgugger · November 23, 2020, 1:15pm

Mmm, it really does point to the number of training steps. When training with Trainer, it should print you that number (and it’s also the upper bound of the progress bar) so you can double-check.

Elidor00 · November 23, 2020, 4:03pm

@sgugger The information that the trainer gives me before training are these:

INFO - transformers.trainer -   ***** Running training *****
INFO - transformers.trainer -     Num examples = 13121
INFO - transformers.trainer -     Num Epochs = 4
INFO - transformers.trainer -     Instantaneous batch size per device = 8
INFO - transformers.trainer -     Total train batch size (w. parallel, distributed & accumulation) = 16
INFO - transformers.trainer -     Gradient Accumulation steps = 1
INFO - transformers.trainer -     Total optimization steps = 3284
Epoch:   0% 0/4 [00:00<?, ?it/s]
Iteration:   0% 0/821 [00:00<?, ?it/s]
Iteration:   0% 1/821 [00:00<08:07,  1.68it/s]

After training I find myself among the results the various checkpoints that advance from 750 to 750:

sgugger · November 23, 2020, 4:49pm

No your number of optimization steps is 3284. 750 is just the number of steps between two checkpoints.

Elidor00 · November 23, 2020, 5:15pm

Oh my god, thank you very much!

But what is the calculation to find that value directly from the code? Because in the code the scheduler and optimizer are instantiated first and then the Trainer.
How do I get that information fo steps before instantiating the Trainer to pass it as a parameter of get_linear_schedule_with_warmup?

sgugger · November 23, 2020, 5:19pm

This is the number of epochs you want to train multiplied by the length of your training dataloader then divided by the number of gradient accumulation steps.
The best way to use a custom optimizer/scheduler is to subclass Trainer and override the method create_optimizer_and_scheduler since in this method, you will get the number of training steps as an argument.

salti · November 24, 2020, 12:46am

One more thing to notice is that you’re passing a weight_decay value of 0.0 for both parameter groups in your optimizer. This is harmless (and kinda pointless) when using the default value (which is also 0.0), but otherwise it can certainly lead to different results.

Elidor00 · November 24, 2020, 10:13am

Yes, I had noticed this too, but as you rightly said, it is the default value. In any case I will try to modify it to see if it brings any benefit to the model.

Thank you!

ezio98 · August 7, 2021, 12:04pm

Hi, say I want to subclass the optimizer() method of Trainer by adding a new parameter. Then I need to change the method called the optimizer(), which is create_optimizer_and_scheduler(). Then I had to change the behavior of the flow one by one. I thought this is not an ideal way, is there any more elegant way?

Topic		Replies	Views
How do use lr_scheduler Beginners	11	14403	January 23, 2024
Trainer Ignoring Weight Decay, Beta arguments Beginners	1	889	July 28, 2023
How can i use torch.optim.lr_scheduler.MultiStepLR with Trainer? Beginners	5	2506	May 11, 2022
How to set different learning rates for different parameters in the model? Beginners	7	273	December 17, 2024
How to ignore attributes of TrainingArguments? Intermediate	4	967	July 30, 2021

Trainer optimizer

Related topics