Unable to train a good model after using exclude_from_weight_decay

xuan4470 · October 19, 2021, 3:57am

I am trying to train a BERT-BASE-UNCASED model and use it as a baseline in my experiment. When I read the paper BERTs of a feather do not generalize together, their paper always has a BERT-BASE-UNCASED model that has accuracy over 84%. However, no matter how I train my Bert Model using AdamW optimizer, my accuracy is always between 83%-83.5%. I read their code and I found that they have a different optimizer(tensor flow) which is shown below.

optimizer = AdamWeightDecayOptimizer(
      learning_rate=learning_rate,
      weight_decay_rate=0.01,
      beta_1=0.9,
      beta_2=0.999,
      epsilon=1e-6,
      exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])

I tried to reproduce the same optimizer using Pytorch, but after I do so, my accuracy declined to 47%. I do not know what did I do wrong.

def add_weight_decay(net, l2_value):
     decay, no_decay = [], []
     for name, param in net.named_parameters():
         if not param.requires_grad: continue # frozen weights
         if 'LayerNorm' in name or name.endswith(".bias"):
             no_decay.append(param)
         else:
             decay.append(param)
     return [{'params': no_decay, 'weight_decay': 0.}, {'params': decay, 'weight_decay': l2_value}]
params = add_weight_decay(model, 0.01)
optimizer = torch.optim.Adam(params, lr=2e-5, betas=(0.9, 0.999), eps=1e-6)

Also, I would like to know why do we need to exclude linear_norm layer and bias in weight decay. If I do not do so, then my accuracy will be around 83.5%

Topic		Replies	Views
How to exclude layers in weight decay Intermediate	1	2920	October 18, 2021
AdamW Pytorch vs Huggingface 🤗Transformers	0	1388	January 27, 2023
How to freeze BERT weights Beginners	0	967	October 28, 2021
Bert model on Acceptability Judgement Task \|\| Optimizer Grouped Parameters Beginners	0	557	September 11, 2021
Trainer Ignoring Weight Decay, Beta arguments Beginners	1	902	July 28, 2023

Unable to train a good model after using exclude_from_weight_decay

Related topics