AdamW Pytorch vs Huggingface

I noticed that the default weight decay parameter differs between pytorch’s implementation and huggingface’s (0 on transformers, 1e-2 on pytorch). Also, when finetuning bert we usually set the weight decay directly on the layer parameters, which makes me think the correct value for the optimizer should be 0, otherwise you would be adding weight decay twice (?).

Example:

no_decay = ["bias", "LayerNorm.weight", "LayerNorm.bias"]
optimizer_grouped_parameters = [
            {
                "params": [
                    p
                    for n, p in model.named_parameters()
                    if not any(nd in n for nd in no_decay)
                ],
                "weight_decay": self.hparams.weight_decay,
            },
            {
                "params": [
                    p
                    for n, p in model.named_parameters()
                    if any(nd in n for nd in no_decay)
                ],
                "weight_decay": 0.0,
            },
        ]
optimizer = AdamW(
            optimizer_grouped_parameters,
            lr=self.hparams.learning_rate,
            eps=self.hparams.adam_epsilon,
            weight_decay=?
)

I think transformers encourages you to use pytorch’s implementation using a deprecation warning, so that makes it even more confusing.

AdamW pytorch: AdamW — PyTorch 1.13 documentation
AdamW transformers: Optimization