No it’s not ignored. Adafactor will use that as an initial “external” lr. I’ve found that Adafactor works best without a learning rate set though as it does a pretty good job of adjusting it internally.
Adafactor(model.parameters()
scale_parameter=True,
relative_step=True,
warmup_init=True,
lr=None
)
You’ll need a scheduler too.
You also need to make sure you have a warmup period so that adafactor can adjust it’s learning rate before training. 5-10%.
It’s not as fast as adamw but adafactor provides superior results with less overhead from my experience with training whisper.
Or this with a lr if you wish.
optimizer = Adafactor(
model.parameters(),
lr=1e-3,
eps=(1e-30, 1e-3),
clip_threshold=1.0,
decay_rate=-0.8,
beta1=None,
weight_decay=0.05,
relative_step=False,
scale_parameter=False,
warmup_init=False,
) # If no lr then set the last three to True.
lr_scheduler = AdafactorSchedule(optimizer) # Since Adafactor performs its own scheduling this class creates a proxy object that retrieves the current lr values from the optimizer.