AdamW implementation

Yuti · July 16, 2021, 11:14am

Hi,
I was looking at the implementation of the AdamW optimizer and I didn’t understand why you put the weight decay at the end.

Shouldn’t you swap between this line:
p.data.addcdiv_(exp_avg, denom, value=-step_size)

and the weight decay part?
Thanks.

The AdamW algorithm from the “DECOUPLED WEIGHT DECAY REGULARIZATION” paper & The relevant source code for transformers.AdamW:

sgugger · July 16, 2021, 2:39pm

The two lines substract an independent thing to the model parameters, so executing them in any order will give the same results.

Yuti · July 17, 2021, 10:39am

Thanks for the fast reply .

I’m not sure why they are independent.
The weight decay line is adding the parameter multiplied by -lr*wd to the parameter itself, so it depends on the parameter’s data.

The first line is adding a value to the parameter’s data i.e. changing it.
So the first line is affecting the value to be added in the second line which makes them dependent.

In the following image x denotes the parameter’s data, a denotes the value added in line 1 and b replaces lr*wd

So the updated x is different in the two versions.

Thanks again.

sgugger · July 17, 2021, 2:08pm

Ah yes you’re right, it is indeed in the wrong place and should be put first, thanks for the lengthy explanation, I was a bit too tired when I read your post first

However there is a line # Add weight decay at the end (fixed version) in the code introduced by @thomwolf so there is probably a reason this is put in second.

Yuti · July 17, 2021, 6:29pm

Oh that’s totally fine thanks for your replies!
I’ll wait for @thomwolf’s response.

Topic		Replies	Views
Does the default weight_decay of 0.0 in transformers.AdamW make sense? Models	2	11788	September 18, 2020
AdamW Pytorch vs Huggingface 🤗Transformers	0	1388	January 27, 2023
Unable to train a good model after using exclude_from_weight_decay Intermediate	0	401	October 19, 2021
Weight decay rate in create optimizer tensorflow Intermediate	0	600	April 6, 2022
There is a adamw optimizer in pytorch version.Is there a adamw in tensorflow2 version 🤗Transformers	1	283	July 26, 2022

AdamW implementation

Related topics