When creating a default optimizer the
Trainer class creates two parameter groups based on whether weight decay should be applied or not and it does that based on the parameter name (does it contain “LayerNorm” or “bias” in it or not).
Problem is, not all models’ parameters are named the same way; GPT2’s layer normalization layers for example are named
ln_ followed by a number or an
f, hence weight decay will be applied to the weights of the
LayerNorm layers in GPT2.
I don’t know if this is an “issue” per se, but it’s definitely something to be cautious about when training/fine-tuning GPT2.