I’m not sure why they are independent.
The weight decay line is adding the parameter multiplied by -lr*wd to the parameter itself, so it depends on the parameter’s data.
The first line is adding a value to the parameter’s data i.e. changing it.
So the first line is affecting the value to be added in the second line which makes them dependent.
In the following image x denotes the parameter’s data, a denotes the value added in line 1 and b replaces lr*wd
So the updated x is different in the two versions.
Ah yes you’re right, it is indeed in the wrong place and should be put first, thanks for the lengthy explanation, I was a bit too tired when I read your post first
However there is a line # Add weight decay at the end (fixed version) in the code introduced by @thomwolf so there is probably a reason this is put in second.