Trainer Ignoring Weight Decay, Beta arguments

slyle · July 28, 2023, 4:38pm

Hello, this is probably an easy one to answer.

Trainer seems to be ignoring my weight decay and adam_beta arguments. I will be training a model for many epochs that consist of many steps, and I want to slow down the rate at which the learning rate falls off so I don’t end up with virtually 0 learn rate after a few epochs.

Specifying weight_decay=0, and increasing adam_beta1 and adam_beta2 do not seem to do anything to the magnitude of learning rate decay.

Am I missing something here? Or is this intended behavior? The weight decay also seems to be altered on a by step basis. I don’t see any arguments that allow me to change the “weight decay strategy”. Is the only way to alter that by subclassing Trainer and overriding the scheduler?

slyle · July 28, 2023, 9:45pm

It turns out it wasn’t ignoring my arguments, but that I was trying to solve the wrong problem.

I needed to leave the Adam arguments alone. Even with 0 weight decay, there is a learn rate scheduler that defaults to a linear rate decay. You need to override the Trainer Class and create a custom scheduler as seen in other forum posts, and you can adjust power to be lower for slower decrease in LR:

class CustomTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def create_optimizer_and_scheduler(self, num_training_steps):
        self.optimizer = torch.optim.AdamW(self.model.parameters(),
                               lr=self.args.learning_rate,
                               weight_decay=self.args.weight_decay)
        self.lr_scheduler = get_polynomial_decay_schedule_with_warmup(
            self.optimizer, 0, num_training_steps, power=0.5)
 
trainer = CustomTrainer(
   model=model,
   args=training_args,
   train_dataset=input_ds['train'],
   eval_dataset=input_ds['test'],
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)

Topic		Replies	Views
Trainer optimizer 🤗Transformers	11	8930	August 7, 2021
Weight decay rate in create optimizer tensorflow Intermediate	0	599	April 6, 2022
How to ignore attributes of TrainingArguments? Intermediate	4	967	July 30, 2021
Does the default weight_decay of 0.0 in transformers.AdamW make sense? Models	2	11694	September 18, 2020
Linear Learning Rate Warmup with step-decay Beginners	4	3266	April 21, 2021

Trainer Ignoring Weight Decay, Beta arguments

Related topics