AdamW implementation

Hi,
I was looking at the :hugs: implementation of the AdamW optimizer and I didn’t understand why you put the weight decay at the end.

Shouldn’t you swap between this line:
p.data.addcdiv_(exp_avg, denom, value=-step_size)

and the weight decay part?
Thanks.

The AdamW algorithm from the “DECOUPLED WEIGHT DECAY REGULARIZATION” paper & The relevant source code for transformers.AdamW:

The two lines substract an independent thing to the model parameters, so executing them in any order will give the same results.

Thanks for the fast reply :slight_smile: .

I’m not sure why they are independent.
The weight decay line is adding the parameter multiplied by -lr*wd to the parameter itself, so it depends on the parameter’s data.

The first line is adding a value to the parameter’s data i.e. changing it.
So the first line is affecting the value to be added in the second line which makes them dependent.

In the following image x denotes the parameter’s data, a denotes the value added in line 1 and b replaces lr*wd

image

So the updated x is different in the two versions.

Thanks again.

Ah yes you’re right, it is indeed in the wrong place and should be put first, thanks for the lengthy explanation, I was a bit too tired when I read your post first :slight_smile:

However there is a line # Add weight decay at the end (fixed version) in the code introduced by @thomwolf so there is probably a reason this is put in second.

1 Like

Oh that’s totally fine :upside_down_face: thanks for your replies!
I’ll wait for @thomwolf’s response.