How is the AdafactorScheluder suppose to be used?

brando · August 5, 2021, 7:29pm

Hi,

I recently saw my transformer model having divergence issues and I saw a paper that uses Adafactor and wanted to try it out. The docs are fantastic but they don’t mention how often or how the Adafactor scheduler actually works. How is that suppose to be used? When do I call the scheduler in my code?

from transformers.optimization import Adafactor, AdafactorSchedule
optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)
lr_scheduler = AdafactorSchedule(optimizer)

all my code is my custom training so idk when lr_scheduler is suppose to be called. Usually it depends on the model or I try to start of with the convenient ReduceLROnPlateau.

When using lr=None with [ Trainer ] you will most likely need to use AdafactorSchedule scheduler as following:

from transformers.optimization import Adafactor, AdafactorSchedule optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None) lr_scheduler = AdafactorSchedule(optimizer) trainer = Trainer(..., optimizers=(optimizer, lr_scheduler))

brando · December 19, 2021, 4:22pm

Apologies for the direct ping, but can you help me get the right person to help me with this? @sgugger?

Thank you in advance!

sgugger · December 20, 2021, 1:48pm

I think @stas or @patrickvonplaten have more experience with Adafactor.

Note that it won’t stay in the library forever: merging it was overspreading ourselves a little bit too much in optimizers territory and we now realize we don’t have the manpower to properly maintain it. So you should use a version from another library to be future-proof

stas · December 20, 2021, 7:54pm

Well, I hacked together AdafactorSchedule since Adafactor uses an internal scheduler and provides no access to it. We needed the workaround to prevent HF Trainer from failing when it calls get_last_lr.

Later someone pointed out that my hack was incomplete since it only reported the schedule for a single param group if I remember correctly.

As @sgugger mentioned it’s best for you too seek out an external-to-transformers solution, since Adafactor is scheduled to be removed in transformers-v5. I think our original copy came from fairseq.

ndvb · February 13, 2023, 8:59am

Something is not clear to me - Should the optimizer (adafactor) be given to the TrainingArguments or to the Trainer?

Nevermetyou · January 8, 2024, 8:06am

I have a same confusion

Topic		Replies	Views
What is "scheduled LR warm-up"? 🤗Transformers	0	325	March 25, 2023
T5 training with Trainer, w/ AdaFactor 🤗Transformers	0	954	February 12, 2023
How to use AdaFactor on TPU? Beginners	0	342	August 19, 2021
How do use lr_scheduler Beginners	11	14454	January 23, 2024
Trainer optimizer 🤗Transformers	11	8898	August 7, 2021

How is the AdafactorScheluder suppose to be used?

Related topics