How is the AdafactorScheluder suppose to be used?

Hi,

I recently saw my transformer model having divergence issues and I saw a paper that uses Adafactor and wanted to try it out. The docs are fantastic but they don’t mention how often or how the Adafactor scheduler actually works. How is that suppose to be used? When do I call the scheduler in my code?

from transformers.optimization import Adafactor, AdafactorSchedule
optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)
lr_scheduler = AdafactorSchedule(optimizer)

all my code is my custom training so idk when lr_scheduler is suppose to be called. Usually it depends on the model or I try to start of with the convenient ReduceLROnPlateau.

When using lr=None with [ Trainer ] you will most likely need to use AdafactorSchedule scheduler as following:

from transformers.optimization import Adafactor, AdafactorSchedule optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None) lr_scheduler = AdafactorSchedule(optimizer) trainer = Trainer(..., optimizers=(optimizer, lr_scheduler))

related:

1 Like

Apologies for the direct ping, but can you help me get the right person to help me with this? @sgugger?

Thank you in advance!

I think @stas or @patrickvonplaten have more experience with Adafactor.

Note that it won’t stay in the library forever: merging it was overspreading ourselves a little bit too much in optimizers territory and we now realize we don’t have the manpower to properly maintain it. So you should use a version from another library to be future-proof :slight_smile:

1 Like

Well, I hacked together AdafactorSchedule since Adafactor uses an internal scheduler and provides no access to it. We needed the workaround to prevent HF Trainer from failing when it calls get_last_lr.

Later someone pointed out that my hack was incomplete since it only reported the schedule for a single param group if I remember correctly.

As @sgugger mentioned it’s best for you too seek out an external-to-transformers solution, since Adafactor is scheduled to be removed in transformers-v5. I think our original copy came from fairseq.

Something is not clear to me - Should the optimizer (adafactor) be given to the TrainingArguments or to the Trainer?

1 Like

I have a same confusion