How do use lr_scheduler

How to use lr_scehuler in Trainer? it seems that whenever I pass AdamW optimizer, it also need the dictionary of params to tune. Since I am using just plain Trainer (not being intimate with PyTorch) The parameters are not exposed to pass to AdamW yielding an error.

Does anyone have an idea of how I can do that?

Hi @Neel-Gupta, you’ll need to create a custom trainer by subclassing Trainer and overriding the create_optimizer_and_scheduler function (see here for the source code):

class MyAwesomeTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Add custom attributes here
            
    def create_optimizer_and_scheduler(self, num_training_steps):
        pass

Assuming that you’re trying to learn some custom parameters, the idea is to add a dict like

{"params": [p for n, p in self.model.named_parameters()  if "name_of_custom_params" in n and p.requires_grad], "lr": self.args.custom_params_lr}

to the optimizer_grouped_parameters list you can see in the source code. Then you can add the remaining bits with something like the following:

def create_optimizer_and_scheduler(self, num_training_steps: int):
    no_decay = ["bias", "LayerNorm.weight"]
    # Add any new parameters to optimize for here as a new dict in the list of dicts
    optimizer_grouped_parameters = ...

    self.optimizer = AdamW(optimizer_grouped_parameters, 
                           lr=self.args.learning_rate, 
                           eps=self.args.adam_epsilon)
    self.lr_scheduler = get_linear_schedule_with_warmup(
        self.optimizer, num_warmup_steps=self.args.warmup_steps, 
        num_training_steps=self.num_training_steps)

Does that make sense?

1 Like

That seems pretty complicated :sweat_smile: I would probably work on this. Thanx a ton for your help!! :+1:

2 Likes

Haha, well at least you don’t have to implement all the other parts of the training loop :slight_smile:

What are you trying to do exactly with the lr scheduler?

I noticed that in the normal available warmup_steps and weight_decay, after quite some steps apparently there might be some misconfiguration of the loss as after being stable and increasing slowly for quite some epochs, it suddenly explodes.

I had the problem before when using Native Tensorflow and had fixed it by applying the scheduler and getting a better accuracy faster and some custom callbacks in TF.

Ah in that case can’t you just configure warmup_steps and weight_decay directly in the TrainingArguments?

You can also change the scheduler type in case that’s what you’re after: transformers.trainer_utils — transformers 4.3.0 documentation

Finally, you can also implement custom callbacks in transformers - see here: Callbacks — transformers 4.3.0 documentation

@lewtun After customizing the Trainer class and optimizer, how should I pass it in Trainer or Training Arguments ??

args = TrainingArguments(
   
    num_train_epochs=2,
    weight_decay=0.01,
    lr_scheduler_type=
   
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_data,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer)

Hi everyone!

I found @lewtun 's response super helpful but I think it’s slightly out-of-date (2021 was two years ago, which in computing is like 4 million years ago lol :wink:)

If you’re using TrainingArguments, you can use lr_scheduler_type of "linear" or "cosine" without the need for a customer trainer since they don’t require any additional arguments. However, if you want to use either of these with warmup or maybe a polynomial or inverse square root scheduler, here’s what to do:

First, if you’re using AdamW and PyTorch, use from torch.optim import AdamW instead of :hugs:’s implementation as :hugs:’s is deprecated.

Let’s say you want to use the polynomial scheduler, for sake of example. Then:

class CustomTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def create_optimizer_and_scheduler(self, num_training_steps):
        self.optimizer = AdamW(self.model.parameters(),
                               lr=self.args.learning_rate,
                               weight_decay=self.args.weight_decay)
        self.lr_scheduler = get_polynomial_decay_schedule_with_warmup(
            self.optimizer, 0, num_training_steps, power=2)

Now you can continue to use TrainingArguments same as before. The value of lr_scheduler_type there doesn’t matter since CustomTrainer will override it.

Now declare your trainer as:

trainer = CustomTrainer(...)

Again, same arguments as you had before but don’t specify optimizers.

And you’re done! :smile:

Thanks,
Diane

3 Likes

Hi @lewtun , please how do I use the pytorch cyclic lr_scheduler here in the hugginface trainer.
torch.optim.lr_scheduler.CyclicLR

@Owos did you figured it out ?
@lewtun

Just seeing this, you define your scheduler and optimizer like this:

optimizer = AdamW(...)
lr_scheduler = get_polynomial_decay_schedule_with_warmup(optimizer, num_warmup_steps=training_args.warmup_steps,
                                                                     num_training_steps=num_training_steps, num_cycles=5)

trainer will be:

trainer = (
            ...,
            optimizers=(optimizer, lr_scheduler)
          )

@jaideepcs

you can set it in the trainer

    # -- max steps manually decided depending on how many tokens we want to train on
    per_device_train_batch_size = batch_size
    print(f'{per_device_train_batch_size=}')
    print(f'{num_epochs=} {max_steps=}')

    # -- Get Optimizer & Scheduler
    # - Get Optimizer
    if optim == 'paged_adamw_32bit':
        from transformers import PagedAdamW_32bit
        optimizer = PagedAdamW_32bit(model.parameters())
    elif optim == 'adamw_manual':
        optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
    else:
        optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
    print(f'{optimizer=}')
    # - Get Scheduler
    if lr_scheduler_type == 'cosine_with_warmup_manual':
        lr_scheduler = get_cosine_schedule_with_warmup(
            optimizer,
            num_warmup_steps=int(max_steps*warmup_ratio),
            num_training_steps=max_steps,
        )
    else:
        lr_scheduler = None
    print(f'{lr_scheduler=}')

    # -- Training arguments and trainer instantiation ref: https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments
    output_dir = Path(f'~/data/results_{today}/').expanduser() if not debug else Path(f'~/data/results/').expanduser()
    # output_dir = '.'
    # print(f'{debug=} {output_dir=} \n {report_to=}')
    training_args = TrainingArguments(
        output_dir=output_dir,  # The output directory where the model predictions and checkpoints will be written.
        # output_dir='.',  # The output directory where the model predictions and checkpoints will be written.
        # num_train_epochs = num_train_epochs, 
        max_steps=max_steps,  # TODO: hard to fix, see above
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,  # based on alpaca https://github.com/tatsu-lab/stanford_alpaca, allows to process effective_batch_size = gradient_accumulation_steps * batch_size, num its to accumulate before opt update step
        gradient_checkpointing = gradient_checkpointing,  # TODO depending on hardware set to true?
        # optim=optim,
        # warmup_steps=int(max_steps*warmup_ratio),  # TODO: once real training starts we can select this number for llama v2, what does llama v2 do to make it stable while v1 didn't?
        # warmup_ratio=warmup_ratio,  # copying alpaca for now, number of steps for a linear warmup, TODO once real training starts change? 
        # weight_decay=0.01,  # TODO once real training change?
        weight_decay=weight_decay,  # TODO once real training change?
        learning_rate = learning_rate,  # TODO once real training change? anything larger than -3 I've had terrible experiences with
        max_grad_norm=1.0, # TODO once real training change?
        # lr_scheduler_type=lr_scheduler_type,  # TODO once real training change? using what I've seen most in vision 
        # lr_scheduler_kwargs=lr_scheduler_kwargs,  # ref: https://huggingface.co/docs/transformers/v4.37.0/en/main_classes/optimizer_schedules#transformers.SchedulerType 
        logging_dir=Path('~/data/maf/logs').expanduser(),
        # save_steps=4000,  # alpaca does 2000, other defaults were 500
        save_steps=max_steps//3,  # alpaca does 2000, other defaults were 500
        # save_steps=1,  # alpaca does 2000, other defaults were 500
        # logging_steps=250,
        # logging_steps=50,  
        logging_first_step=True,
        # logging_steps=3,
        logging_steps=1,
        remove_unused_columns=False,  # TODO don't get why https://stackoverflow.com/questions/76879872/how-to-use-huggingface-hf-trainer-train-with-custom-collate-function/76929999#76929999 , https://claude.ai/chat/475a4638-cee3-4ce0-af64-c8b8d1dc0d90
        report_to=report_to,  # change to wandb!
        fp16=False,  # never ever set to True
        bf16=torch.cuda.get_device_capability(torch.cuda.current_device())[0] >= 8,  # if >= 8 ==> brain float 16 available or set to True if you always want fp32
    )
    print(f'{pretrained_model_name_or_path=}\n{optim=}\n{learning_rate=}')

    # TODO: might be nice to figure our how llamav2 counts the number of token's they've trained on
    print(f'{train_dataset=}')
    # print(f'{eval_dataset=}')
    trainer = Trainer(
        model=model,
        args=training_args,  
        train_dataset=train_dataset,
        optimizers=(optimizer, lr_scheduler),
    )

    # - Train
    cuda_visible_devices = os.environ.get('CUDA_VISIBLE_DEVICES')
    if cuda_visible_devices is not None:
        print(f"CUDA_VISIBLE_DEVICES = {cuda_visible_devices}")
    trainer.train()
    trainer.save_model(output_dir=output_dir)  # TODO is this really needed? https://discuss.huggingface.co/t/do-we-need-to-explicity-save-the-model-if-the-save-steps-is-not-a-multiple-of-the-num-steps-with-hf/56745

related Using Cosine LR scheduler via TrainingArguments in Trainer - #8 by brando