Trainer optimizer

Hi everyone,

in my code I instantiate a trainer as follows:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

I don’t specify anything in the “optimizers” field as I’ve always used the default one (AdamW).
I tried to create an optimizer instance similar to the default one so I could try to change the learning rate (lr).
The code I used to simulate the default optimizer is the following:

no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
    {
        "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]
optimizer = AdamW(optimizer_grouped_parameters, lr=5e-05, eps=1e-08)
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=0, num_training_steps=750
)

and then pass the parameters optimizer and scheduler to the optimizer field of the trainer.

The problem is that with this definition of optimizer, I have different results than the default one (even if I thought they were identical). What should I change to create an optimizer identical to the default one, but where I can change the lr directly from my code?

Thanks!

At first glance, it might be linked to the number of training_steps? Are you sure your other training does 750 steps? Also I don’t know what your training_args are but if any of them don’t use the default value, that could also be the reason for the change in results.

Yes, the num_training_steps is 750 sure. The training_args are the default transformers that are at this link.

The code is:

from transformers import HfArgumentParser, Trainer, TrainingArguments

parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()

P.s. ModelArguments contains the arguments pertaining to which model/config/tokenizer we are going to fine-tune from. While DataTrainingArguments contains the arguments pertaining to what data we are going to input our model for training and eval.

@sgugger One thing I noticed, doing other experiments, is that using the optimizer as I have implemented it, after some epochs the learning rate reaches 0.0, while with the default one I always have a value > 0.

I don’t know if this information can be used to understand how my implementation differs from the default one. They look really identical to me … I have no idea…

Mmm, it really does point to the number of training steps. When training with Trainer, it should print you that number (and it’s also the upper bound of the progress bar) so you can double-check.

@sgugger The information that the trainer gives me before training are these:

INFO - transformers.trainer -   ***** Running training *****
INFO - transformers.trainer -     Num examples = 13121
INFO - transformers.trainer -     Num Epochs = 4
INFO - transformers.trainer -     Instantaneous batch size per device = 8
INFO - transformers.trainer -     Total train batch size (w. parallel, distributed & accumulation) = 16
INFO - transformers.trainer -     Gradient Accumulation steps = 1
INFO - transformers.trainer -     Total optimization steps = 3284
Epoch:   0% 0/4 [00:00<?, ?it/s]
Iteration:   0% 0/821 [00:00<?, ?it/s]
Iteration:   0% 1/821 [00:00<08:07,  1.68it/s]

After training I find myself among the results the various checkpoints that advance from 750 to 750:
image

No your number of optimization steps is 3284. 750 is just the number of steps between two checkpoints.

1 Like

Oh my god, thank you very much!

But what is the calculation to find that value directly from the code? Because in the code the scheduler and optimizer are instantiated first and then the Trainer.
How do I get that information fo steps before instantiating the Trainer to pass it as a parameter of get_linear_schedule_with_warmup?

This is the number of epochs you want to train multiplied by the length of your training dataloader then divided by the number of gradient accumulation steps.
The best way to use a custom optimizer/scheduler is to subclass Trainer and override the method create_optimizer_and_scheduler since in this method, you will get the number of training steps as an argument.

1 Like

One more thing to notice is that you’re passing a weight_decay value of 0.0 for both parameter groups in your optimizer. This is harmless (and kinda pointless) when using the default value (which is also 0.0), but otherwise it can certainly lead to different results.

1 Like

Yes, I had noticed this too, but as you rightly said, it is the default value. In any case I will try to modify it to see if it brings any benefit to the model.

Thank you!

Hi, say I want to subclass the optimizer() method of Trainer by adding a new parameter. Then I need to change the method called the optimizer(), which is create_optimizer_and_scheduler(). Then I had to change the behavior of the flow one by one. I thought this is not an ideal way, is there any more elegant way?