I recently saw my transformer model having divergence issues and I saw a paper that uses Adafactor and wanted to try it out. The docs are fantastic but they don’t mention how often or how the Adafactor scheduler actually works. How is that suppose to be used? When do I call the scheduler in my code?
all my code is my custom training so idk when lr_scheduler is suppose to be called. Usually it depends on the model or I try to start of with the convenient ReduceLROnPlateau.
When using lr=None with [ Trainer ] you will most likely need to use AdafactorSchedule scheduler as following:
Note that it won’t stay in the library forever: merging it was overspreading ourselves a little bit too much in optimizers territory and we now realize we don’t have the manpower to properly maintain it. So you should use a version from another library to be future-proof
Well, I hacked together AdafactorSchedule since Adafactor uses an internal scheduler and provides no access to it. We needed the workaround to prevent HF Trainer from failing when it calls get_last_lr.
Later someone pointed out that my hack was incomplete since it only reported the schedule for a single param group if I remember correctly.
As @sgugger mentioned it’s best for you too seek out an external-to-transformers solution, since Adafactor is scheduled to be removed in transformers-v5. I think our original copy came from fairseq.