Unable to train a good model after using exclude_from_weight_decay

I am trying to train a BERT-BASE-UNCASED model and use it as a baseline in my experiment. When I read the paper BERTs of a feather do not generalize together, their paper always has a BERT-BASE-UNCASED model that has accuracy over 84%. However, no matter how I train my Bert Model using AdamW optimizer, my accuracy is always between 83%-83.5%. I read their code and I found that they have a different optimizer(tensor flow) which is shown below.

optimizer = AdamWeightDecayOptimizer(
      exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])

I tried to reproduce the same optimizer using Pytorch, but after I do so, my accuracy declined to 47%. I do not know what did I do wrong.

def add_weight_decay(net, l2_value):
     decay, no_decay = [], []
     for name, param in net.named_parameters():
         if not param.requires_grad: continue # frozen weights
         if 'LayerNorm' in name or name.endswith(".bias"):
     return [{'params': no_decay, 'weight_decay': 0.}, {'params': decay, 'weight_decay': l2_value}]
params = add_weight_decay(model, 0.01)
optimizer = torch.optim.Adam(params, lr=2e-5, betas=(0.9, 0.999), eps=1e-6)

Also, I would like to know why do we need to exclude linear_norm layer and bias in weight decay. If I do not do so, then my accuracy will be around 83.5%