BERT is optimized with Adam (Kingma and Ba,
2015) using the following parameters: β1 = 0.9,
β2 = 0.999, ǫ = 1e-6 and L2 weight decay of 0.01. The learning rate is warmed up
over the first 10,000 steps to a peak value of
1e-4, and then linearly decayed. BERT trains
with a dropout of 0.1 on all layers and attention weights, and a GELU activation function (Hendrycks and Gimpel, 2016). Models are
pretrained for S = 1,000,000 updates, with minibatches containing B = 256 sequences of maximum length T = 512 tokens.
I’m trying to figure out how to replicate this optimizer schedule. I see that in the trainer.py code there’s AdamW and get_linear_schedule_with_warmup
But I’m not sure how to replicate Roberta’s learning rate schedule from these classes
It seems that AdamW already has the decay rate, so using AdamW with get_linear_schedule_with_warmup will result in two types of decay. So to me it makes more sense to use AdamW with get_constant_schedule_with_warmup.
I am also wondering how to set the schedule based on 1) a starting learning rate 2) warm it up to a particular maximum value 3) from the maximum value, decay using a particular decay rate.
The classes on the main optimization class seem to be based on warming up/decaying to/from zero.
It seems that the learning rate starts what is specified in the optimizer, increased to a particular LR, and then linearly decreased to zero. It seems that get_linear_schedule_with_warmup could work, but would need to be altered for a different learning rate.
It seems that it uses torch.optim.lr_scheduler.LambdaLR
So I’m thinking of creating a custom function directly to use that method
I am attempting to make a custom scheduler which replicates the Roberta warmup, so far I came up with this, based on Huggingface’s linear warmup scheduler
Which is not quite exact, since the optimizer is AdamW, which has its own weight decay. For it to be exact, I need init_lr to be replaced by the current optimizer learning rate, but from the documentation on LambdaLR, lr_lambda seems to only take in an int.
So I need a way to adjust the optimizer’s learning rate based on its current learning rate.