Does the default weight_decay of 0.0 in transformers.AdamW make sense?


I have a question regarding the AdamW optimizer default weight_decay value.

In the Docs we can clearly see that the AdamW optimizer sets the default weight decay to 0.0.

Given that the whole purpose of AdamW is to decouple the weight decay regularization, is my understanding that the results anyone can get with AdamW and Adam if both are used with weight_decay=0.0 (this is, without weight decay) should be exactly the same.

Therefore, shouldn’t make more sense to have the default weight decay for AdamW > 0?

Thank you so much!

1 Like

Even though I agree about the default value (it should probably be 0.01 as in the PyTorch implementation), this probably should not be changed without warning because it breaks backwards compatibility. It was also implemented in transformers before it was available in PyTorch itself.

I guess it is implemented in this way, because most of the time you decide in the initialization which parameters you want to decay and which ones shouldn’t be decayed, such as here:

In general the default of all optimizers for weight decay is 0 (I don’t know why pytorch set 0.01 for just AdamW, all other optimizers have a default at 0) because you have to opt-in for weight decay. Even if it’s true that Adam and AdamW behave the same way when the weight decay is set to 0, I don’t think it’s enough to change that default behavior (0.01 is a great default otherwise, that is the one we set in fastai for the Learner after countless experiments, but I think it should be set in a higher-level API, not the optimizer itself).

And like @BramVanroy said, it would be such a breaking change that even if we really wanted to change that default, we probably wouldn’t.