When creating a default optimizer the Trainer
class creates two parameter groups based on whether weight decay should be applied or not and it does that based on the parameter name (does it contain “LayerNorm” or “bias” in it or not).
Problem is, not all models’ parameters are named the same way; GPT2’s layer normalization layers for example are named ln_
followed by a number or an f
, hence weight decay will be applied to the weights of the LayerNorm
layers in GPT2.
I don’t know if this is an “issue” per se, but it’s definitely something to be cautious about when training/fine-tuning GPT2.