Parameter groups and GPT2 LayerNorm

When creating a default optimizer the Trainer class creates two parameter groups based on whether weight decay should be applied or not and it does that based on the parameter name (does it contain “LayerNorm” or “bias” in it or not).
Problem is, not all models’ parameters are named the same way; GPT2’s layer normalization layers for example are named ln_ followed by a number or an f, hence weight decay will be applied to the weights of the LayerNorm layers in GPT2.

I don’t know if this is an “issue” per se, but it’s definitely something to be cautious about when training/fine-tuning GPT2.

Oh indeed. A check on names only does not sound super smart. Will try to write something that checks the class of the modules instead.

1 Like

Done in #10598.

1 Like

Thanks!
That was very quick :zap: