How to create the warmup and decay from the BERT/Roberta papers?

reSearch2vec · November 17, 2020, 7:48pm

Roberta’s pretraining is described below

BERT is optimized with Adam (Kingma and Ba,
2015) using the following parameters: β1 = 0.9,
β2 = 0.999, ǫ = 1e-6 and L2 weight decay of 0.01. The learning rate is warmed up
over the first 10,000 steps to a peak value of
1e-4, and then linearly decayed. BERT trains
with a dropout of 0.1 on all layers and attention weights, and a GELU activation function (Hendrycks and Gimpel, 2016). Models are
pretrained for S = 1,000,000 updates, with minibatches containing B = 256 sequences of maximum length T = 512 tokens.

I’m trying to figure out how to replicate this optimizer schedule. I see that in the trainer.py code there’s AdamW and get_linear_schedule_with_warmup

github.com

huggingface/transformers/blob/main/src/transformers/trainer.py#L510


      
              callbacks, self.model, self.tokenizer, self.optimizer, self.lr_scheduler
          )
          self.add_callback(PrinterCallback if self.args.disable_tqdm else DEFAULT_PROGRESS_CALLBACK)
          
          # Will be set to True by `self._setup_loggers()` on first call to `self.log()`.
          self._loggers_initialized = False
          
          # Create distant repo and output directory if needed
          self.hub_model_id = None
          if self.args.push_to_hub:
              self.init_hf_repo()
          if self.args.should_save:
              os.makedirs(self.args.output_dir, exist_ok=True)
          
          if not callable(self.data_collator) and callable(getattr(self.data_collator, "collate_batch", None)):
              raise ValueError("The `data_collator` should be a simple callable (function, class with `__call__`).")
          
          if args.max_steps > 0:
              logger.info("max_steps is given, it will override any value given in num_train_epochs")
          
          if train_dataset is not None and not has_length(train_dataset) and args.max_steps <= 0:

But I’m not sure how to replicate Roberta’s learning rate schedule from these classes

It seems that AdamW already has the decay rate, so using AdamW with get_linear_schedule_with_warmup will result in two types of decay. So to me it makes more sense to use AdamW with get_constant_schedule_with_warmup.

I am also wondering how to set the schedule based on 1) a starting learning rate 2) warm it up to a particular maximum value 3) from the maximum value, decay using a particular decay rate.

The classes on the main optimization class seem to be based on warming up/decaying to/from zero.

reSearch2vec · November 18, 2020, 3:56am

From further looking into the code for Roberta (https://github.com/pytorch/fairseq/blob/dd52ed0f3896639b3c04aa67c44775f689faf1a5/fairseq/optim/lr_scheduler/polynomial_decay_schedule.py) and also Bert (https://github.com/google-research/bert/blob/master/optimization.py#L36)

It seems that the learning rate starts what is specified in the optimizer, increased to a particular LR, and then linearly decreased to zero. It seems that get_linear_schedule_with_warmup could work, but would need to be altered for a different learning rate.

It seems that it uses torch.optim.lr_scheduler.LambdaLR

github.com

huggingface/transformers/blob/08f534d2da47875a4b7eb1c125cfa7f0f3b79642/src/transformers/optimization.py#L90


            The number of steps for the warmup phase.
        num_training_steps (:obj:`int`):
            The total number of training steps.
        last_epoch (:obj:`int`, `optional`, defaults to -1):
            The index of the last epoch when resuming training.
    Return:
        :obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
    """
    def lr_lambda(current_step: int):
        if current_step < num_warmup_steps:
            return float(current_step) / float(max(1, num_warmup_steps))
        return max(
            0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps))
        )
    return LambdaLR(optimizer, lr_lambda, last_epoch)
def get_cosine_schedule_with_warmup(

So I’m thinking of creating a custom function directly to use that method

reSearch2vec · November 18, 2020, 2:49pm

I am attempting to make a custom scheduler which replicates the Roberta warmup, so far I came up with this, based on Huggingface’s linear warmup scheduler

github.com

huggingface/transformers/blob/master/src/transformers/optimization.py#L71


    """
    def lr_lambda(current_step: int):
        if current_step < num_warmup_steps:
            return float(current_step) / float(max(1.0, num_warmup_steps))
        return 1.0
    return LambdaLR(optimizer, lr_lambda, last_epoch=last_epoch)
def get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1):
    """
    Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after
    a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.
    Args:
        optimizer (:class:`~torch.optim.Optimizer`):
            The optimizer for which to schedule the learning rate.
        num_warmup_steps (:obj:`int`):
            The number of steps for the warmup phase.
        num_training_steps (:obj:`int`):

def get_linear_schedule_with_warmup_with_peak(optimizer, num_warmup_steps, num_training_steps, init_lr, peak_lr, last_epoch=-1):

    def lr_lambda(current_step: int):
        if current_step < num_warmup_steps:
            return (float(current_step) / float(max(1, num_warmup_steps)))*(peak_lr/init_lr)
        return max(
            0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps))
        )

    return LambdaLR(optimizer, lr_lambda, last_epoch)

Which is not quite exact, since the optimizer is AdamW, which has its own weight decay. For it to be exact, I need init_lr to be replaced by the current optimizer learning rate, but from the documentation on LambdaLR, lr_lambda seems to only take in an int.

So I need a way to adjust the optimizer’s learning rate based on its current learning rate.

Topic		Replies	Views
Linear Learning Rate Warmup with step-decay Beginners	4	3263	April 21, 2021
Is there an easy way to apply layer-wise decaying learning rate in huggingface trainer for RobertaMaskedForLM? Research	3	2940	April 5, 2022
How to freeze BERT weights Beginners	0	963	October 28, 2021
Inconsistencies between BERT and RoBERTa: what am I doing wrong? Beginners	0	360	May 11, 2022
Trainer Ignoring Weight Decay, Beta arguments Beginners	1	893	July 28, 2023

How to create the warmup and decay from the BERT/Roberta papers?

Related topics