AdamW Pytorch vs Huggingface

JAugusto97 · January 27, 2023, 1:16pm

I noticed that the default weight decay parameter differs between pytorch’s implementation and huggingface’s (0 on transformers, 1e-2 on pytorch). Also, when finetuning bert we usually set the weight decay directly on the layer parameters, which makes me think the correct value for the optimizer should be 0, otherwise you would be adding weight decay twice (?).

Example:

no_decay = ["bias", "LayerNorm.weight", "LayerNorm.bias"]
optimizer_grouped_parameters = [
            {
                "params": [
                    p
                    for n, p in model.named_parameters()
                    if not any(nd in n for nd in no_decay)
                ],
                "weight_decay": self.hparams.weight_decay,
            },
            {
                "params": [
                    p
                    for n, p in model.named_parameters()
                    if any(nd in n for nd in no_decay)
                ],
                "weight_decay": 0.0,
            },
        ]
optimizer = AdamW(
            optimizer_grouped_parameters,
            lr=self.hparams.learning_rate,
            eps=self.hparams.adam_epsilon,
            weight_decay=?
)

I think transformers encourages you to use pytorch’s implementation using a deprecation warning, so that makes it even more confusing.

AdamW pytorch: AdamW — PyTorch 1.13 documentation
AdamW transformers: Optimization

Topic		Replies	Views
Does the default weight_decay of 0.0 in transformers.AdamW make sense? Models	2	11691	September 18, 2020
FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead Beginners	2	5098	July 9, 2023
Unable to train a good model after using exclude_from_weight_decay Intermediate	0	399	October 19, 2021
Bert model on Acceptability Judgement Task \|\| Optimizer Grouped Parameters Beginners	0	556	September 11, 2021
How to freeze BERT weights Beginners	0	964	October 28, 2021

AdamW Pytorch vs Huggingface

Related topics