How to create the warmup and decay from the BERT/Roberta papers?

Roberta’s pretraining is described below

BERT is optimized with Adam (Kingma and Ba,
2015) using the following parameters: β1 = 0.9,
β2 = 0.999, ǫ = 1e-6 and L2 weight decay of 0.01. The learning rate is warmed up
over the first 10,000 steps to a peak value of
1e-4, and then linearly decayed. BERT trains
with a dropout of 0.1 on all layers and attention weights, and a GELU activation function (Hendrycks and Gimpel, 2016). Models are
pretrained for S = 1,000,000 updates, with minibatches containing B = 256 sequences of maximum length T = 512 tokens.

I’m trying to figure out how to replicate this optimizer schedule. I see that in the code there’s AdamW and get_linear_schedule_with_warmup

But I’m not sure how to replicate Roberta’s learning rate schedule from these classes

It seems that AdamW already has the decay rate, so using AdamW with get_linear_schedule_with_warmup will result in two types of decay. So to me it makes more sense to use AdamW with get_constant_schedule_with_warmup.

I am also wondering how to set the schedule based on 1) a starting learning rate 2) warm it up to a particular maximum value 3) from the maximum value, decay using a particular decay rate.

The classes on the main optimization class seem to be based on warming up/decaying to/from zero.

From further looking into the code for Roberta ( and also Bert (

It seems that the learning rate starts what is specified in the optimizer, increased to a particular LR, and then linearly decreased to zero. It seems that get_linear_schedule_with_warmup could work, but would need to be altered for a different learning rate.

It seems that it uses torch.optim.lr_scheduler.LambdaLR

So I’m thinking of creating a custom function directly to use that method

I am attempting to make a custom scheduler which replicates the Roberta warmup, so far I came up with this, based on Huggingface’s linear warmup scheduler

def get_linear_schedule_with_warmup_with_peak(optimizer, num_warmup_steps, num_training_steps, init_lr, peak_lr, last_epoch=-1):

    def lr_lambda(current_step: int):
        if current_step < num_warmup_steps:
            return (float(current_step) / float(max(1, num_warmup_steps)))*(peak_lr/init_lr)
        return max(
            0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps))

    return LambdaLR(optimizer, lr_lambda, last_epoch)

Which is not quite exact, since the optimizer is AdamW, which has its own weight decay. For it to be exact, I need init_lr to be replaced by the current optimizer learning rate, but from the documentation on LambdaLR, lr_lambda seems to only take in an int.

So I need a way to adjust the optimizer’s learning rate based on its current learning rate.