For the Seq2SeqTrainingArguments class, what happens when I set both adafactor=True and set a learning rate?

Say that I have the following Seq2SeqTrainingArguments class:

Seq2SeqTrainingArguments(
    adafactor = True,
    optim = "adafactor",
    learning_rate = 1e-4
)

In this case, I am not sure if the learning_rate is actually used anywhere. From the Seq2SeqTrainingArguments documentation:

  • learning_rate (float, optional, defaults to 5e-5) — The initial learning rate for AdamW optimizer.

Does this mean that it is completely ignored for Adafactor?

Thank you!

No it’s not ignored. Adafactor will use that as an initial “external” lr. I’ve found that Adafactor works best without a learning rate set though as it does a pretty good job of adjusting it internally.
Adafactor(model.parameters()
scale_parameter=True,
relative_step=True,
warmup_init=True,
lr=None
)
You’ll need a scheduler too.
You also need to make sure you have a warmup period so that adafactor can adjust it’s learning rate before training. 5-10%.

It’s not as fast as adamw but adafactor provides superior results with less overhead from my experience with training whisper.

Or this with a lr if you wish.

optimizer = Adafactor(
model.parameters(),
lr=1e-3,
eps=(1e-30, 1e-3),
clip_threshold=1.0,
decay_rate=-0.8,
beta1=None,
weight_decay=0.05,
relative_step=False,
scale_parameter=False,
warmup_init=False,
) # If no lr then set the last three to True.

lr_scheduler = AdafactorSchedule(optimizer) # Since Adafactor performs its own scheduling this class creates a proxy object that retrieves the current lr values from the optimizer.