Improvements with SWA

prajjwal1 · August 25, 2020, 10:01am

Has anyone tried SWA with transformers ? Would be interesting to see how much gains does it provide over just AdamW. I’m thinking of adding it to the Trainer if there are significant gains since Pytorch natively supports it.

sgugger · August 25, 2020, 10:33am

SWA is a wrapper around any optimizer, so you could try it with AdamW. Fair warning though, I have never been able to reproduce any results of the original paper of finding it actually helped with anything when using it in fastai (maybe why it took two years to have it inside PyTorch). It’s also supposed to work best with a cyclical schedule (and you average the weights at the end of the cycles) which are not used in transfomers.

prajjwal1 · August 25, 2020, 10:37am

Yes I saw. It’s not exclusive to SGD. But the PR and the torch contrib implementation doesn’t mention this aspect of reproducibility issue. Many people have seen improvements as per the blog. I wonder how their LR scheduling is different from cyclical LR (in what ways is it better ?). Btw why are cyclical schedulers not used with transformers ?

sgugger · August 25, 2020, 10:42am

They are not SOTA (or take a longer time to be) since 2 years ago. It’s just that fastai and transformers models are the only wide place that knows it In the case of transformers, I know the GPT models just don’t train as well with another schedule. I don’t know if all the BERT models are just following that trend but I’d guess they also need that kind of schedule to train properly.

hasansalimkanmaz · January 6, 2021, 8:40pm

I was planning to add SWA to transformers by working on a PR. After seeing this thread, I am a lot skeptical about starting working on it. @sgugger I have some questions for you. Any answer will be much appreciated.

You have said that it works well with cyclical learning rates. When I look at the paper and also the image below, I see that constant learning rate also achieves better results.
Did you experiment transformers with SWA and constant learning rate?

@prajjwal1 Do you have any other insights about SWA with transformers?

FYI: I also opened a feature request to get more insights from others.

hasansalimkanmaz · January 12, 2021, 6:06am

I also couldn’t produce better results with SWA. Thanks for your input @sgugger

Topic		Replies	Views
Use torch.optim.lr_scheduler.CyclicLR with Trainer 🤗Transformers	0	428	May 12, 2023
If there are adamw optimizer in pytorch version, while there aren't have a same one in tensorflow version? 🤗Transformers	0	218	July 23, 2022
AdamW Pytorch vs Huggingface 🤗Transformers	0	1393	January 27, 2023
How do use lr_scheduler Beginners	11	14715	January 23, 2024
How is the AdafactorScheluder suppose to be used? Models	5	4046	January 8, 2024

Improvements with SWA

Related topics