Improvements with SWA

Has anyone tried SWA with transformers ? Would be interesting to see how much gains does it provide over just AdamW. I’m thinking of adding it to the Trainer if there are significant gains since Pytorch natively supports it.

SWA is a wrapper around any optimizer, so you could try it with AdamW. Fair warning though, I have never been able to reproduce any results of the original paper of finding it actually helped with anything when using it in fastai (maybe why it took two years to have it inside PyTorch). It’s also supposed to work best with a cyclical schedule (and you average the weights at the end of the cycles) which are not used in transfomers.

Yes I saw. It’s not exclusive to SGD. But the PR and the torch contrib implementation doesn’t mention this aspect of reproducibility issue. Many people have seen improvements as per the blog. I wonder how their LR scheduling is different from cyclical LR (in what ways is it better ?). Btw why are cyclical schedulers not used with transformers ?

They are not SOTA (or take a longer time to be) since 2 years ago. It’s just that fastai and transformers models are the only wide place that knows it :slight_smile: In the case of transformers, I know the GPT models just don’t train as well with another schedule. I don’t know if all the BERT models are just following that trend but I’d guess they also need that kind of schedule to train properly.

I was planning to add SWA to transformers by working on a PR. After seeing this thread, I am a lot skeptical about starting working on it. @sgugger I have some questions for you. Any answer will be much appreciated.

  1. You have said that it works well with cyclical learning rates. When I look at the paper and also the image below, I see that constant learning rate also achieves better results.
  2. Did you experiment transformers with SWA and constant learning rate?

@prajjwal1 Do you have any other insights about SWA with transformers?

FYI: I also opened a feature request to get more insights from others.


I also couldn’t produce better results with SWA. Thanks for your input @sgugger