Parameter groups and GPT2 LayerNorm

salti · March 8, 2021, 1:36pm

When creating a default optimizer the Trainer class creates two parameter groups based on whether weight decay should be applied or not and it does that based on the parameter name (does it contain “LayerNorm” or “bias” in it or not).
Problem is, not all models’ parameters are named the same way; GPT2’s layer normalization layers for example are named ln_ followed by a number or an f, hence weight decay will be applied to the weights of the LayerNorm layers in GPT2.

I don’t know if this is an “issue” per se, but it’s definitely something to be cautious about when training/fine-tuning GPT2.

sgugger · March 8, 2021, 2:17pm

Oh indeed. A check on names only does not sound super smart. Will try to write something that checks the class of the modules instead.

sgugger · March 8, 2021, 9:35pm

Done in #10598.

salti · March 9, 2021, 1:14pm

Thanks!
That was very quick

Topic		Replies	Views
Bert model on Acceptability Judgement Task \|\| Optimizer Grouped Parameters Beginners	0	556	September 11, 2021
How to set different learning rates for different parameters in the model? Beginners	7	302	December 17, 2024
AdamW Pytorch vs Huggingface 🤗Transformers	0	1385	January 27, 2023
Trainer optimizer 🤗Transformers	11	8920	August 7, 2021
Why do GPT2 initialize the weights of residual layers? Models	0	553	January 11, 2023

Parameter groups and GPT2 LayerNorm

Related topics