How do use lr_scheduler

Neel-Gupta · February 28, 2021, 4:49pm

How to use lr_scehuler in Trainer? it seems that whenever I pass AdamW optimizer, it also need the dictionary of params to tune. Since I am using just plain Trainer (not being intimate with PyTorch) The parameters are not exposed to pass to AdamW yielding an error.

Does anyone have an idea of how I can do that?

lewtun · February 28, 2021, 8:01pm

Hi @Neel-Gupta, you’ll need to create a custom trainer by subclassing Trainer and overriding the create_optimizer_and_scheduler function (see here for the source code):

class MyAwesomeTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Add custom attributes here
            
    def create_optimizer_and_scheduler(self, num_training_steps):
        pass

Assuming that you’re trying to learn some custom parameters, the idea is to add a dict like

{"params": [p for n, p in self.model.named_parameters()  if "name_of_custom_params" in n and p.requires_grad], "lr": self.args.custom_params_lr}

to the optimizer_grouped_parameters list you can see in the source code. Then you can add the remaining bits with something like the following:

def create_optimizer_and_scheduler(self, num_training_steps: int):
    no_decay = ["bias", "LayerNorm.weight"]
    # Add any new parameters to optimize for here as a new dict in the list of dicts
    optimizer_grouped_parameters = ...

    self.optimizer = AdamW(optimizer_grouped_parameters, 
                           lr=self.args.learning_rate, 
                           eps=self.args.adam_epsilon)
    self.lr_scheduler = get_linear_schedule_with_warmup(
        self.optimizer, num_warmup_steps=self.args.warmup_steps, 
        num_training_steps=self.num_training_steps)

Does that make sense?

Neel-Gupta · February 28, 2021, 8:13pm

That seems pretty complicated I would probably work on this. Thanx a ton for your help!!

lewtun · February 28, 2021, 8:20pm

Haha, well at least you don’t have to implement all the other parts of the training loop

What are you trying to do exactly with the lr scheduler?

Neel-Gupta · February 28, 2021, 8:53pm

I noticed that in the normal available warmup_steps and weight_decay, after quite some steps apparently there might be some misconfiguration of the loss as after being stable and increasing slowly for quite some epochs, it suddenly explodes.

I had the problem before when using Native Tensorflow and had fixed it by applying the scheduler and getting a better accuracy faster and some custom callbacks in TF.

lewtun · February 28, 2021, 9:03pm

Ah in that case can’t you just configure warmup_steps and weight_decay directly in the TrainingArguments?

You can also change the scheduler type in case that’s what you’re after: transformers.trainer_utils — transformers 4.3.0 documentation

Finally, you can also implement custom callbacks in transformers - see here: Callbacks — transformers 4.3.0 documentation

pchhapolika · March 23, 2023, 7:57am

@lewtun After customizing the Trainer class and optimizer, how should I pass it in Trainer or Training Arguments ??

args = TrainingArguments(
   
    num_train_epochs=2,
    weight_decay=0.01,
    lr_scheduler_type=
   
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_data,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer)

dmnapolitano · April 19, 2023, 6:40pm

Hi everyone!

I found @lewtun 's response super helpful but I think it’s slightly out-of-date (2021 was two years ago, which in computing is like 4 million years ago lol )

If you’re using TrainingArguments, you can use lr_scheduler_type of "linear" or "cosine" without the need for a customer trainer since they don’t require any additional arguments. However, if you want to use either of these with warmup or maybe a polynomial or inverse square root scheduler, here’s what to do:

First, if you’re using AdamW and PyTorch, use from torch.optim import AdamW instead of ’s implementation as ’s is deprecated.

Let’s say you want to use the polynomial scheduler, for sake of example. Then:

class CustomTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def create_optimizer_and_scheduler(self, num_training_steps):
        self.optimizer = AdamW(self.model.parameters(),
                               lr=self.args.learning_rate,
                               weight_decay=self.args.weight_decay)
        self.lr_scheduler = get_polynomial_decay_schedule_with_warmup(
            self.optimizer, 0, num_training_steps, power=2)

Now you can continue to use TrainingArguments same as before. The value of lr_scheduler_type there doesn’t matter since CustomTrainer will override it.

Now declare your trainer as:

trainer = CustomTrainer(...)

Again, same arguments as you had before but don’t specify optimizers.

And you’re done!

Thanks,
Diane

Owos · May 12, 2023, 10:18am

Hi @lewtun , please how do I use the pytorch cyclic lr_scheduler here in the hugginface trainer.
torch.optim.lr_scheduler.CyclicLR

jaideepcs · September 1, 2023, 3:38pm

@Owos did you figured it out ?
@lewtun

Owos · December 19, 2023, 2:02pm

Just seeing this, you define your scheduler and optimizer like this:

optimizer = AdamW(...)
lr_scheduler = get_polynomial_decay_schedule_with_warmup(optimizer, num_warmup_steps=training_args.warmup_steps,
                                                                     num_training_steps=num_training_steps, num_cycles=5)

trainer will be:

trainer = (
            ...,
            optimizers=(optimizer, lr_scheduler)
          )

@jaideepcs

brando · January 23, 2024, 7:47pm

you can set it in the trainer

    # -- max steps manually decided depending on how many tokens we want to train on
    per_device_train_batch_size = batch_size
    print(f'{per_device_train_batch_size=}')
    print(f'{num_epochs=} {max_steps=}')

    # -- Get Optimizer & Scheduler
    # - Get Optimizer
    if optim == 'paged_adamw_32bit':
        from transformers import PagedAdamW_32bit
        optimizer = PagedAdamW_32bit(model.parameters())
    elif optim == 'adamw_manual':
        optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
    else:
        optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
    print(f'{optimizer=}')
    # - Get Scheduler
    if lr_scheduler_type == 'cosine_with_warmup_manual':
        lr_scheduler = get_cosine_schedule_with_warmup(
            optimizer,
            num_warmup_steps=int(max_steps*warmup_ratio),
            num_training_steps=max_steps,
        )
    else:
        lr_scheduler = None
    print(f'{lr_scheduler=}')

    # -- Training arguments and trainer instantiation ref: https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments
    output_dir = Path(f'~/data/results_{today}/').expanduser() if not debug else Path(f'~/data/results/').expanduser()
    # output_dir = '.'
    # print(f'{debug=} {output_dir=} \n {report_to=}')
    training_args = TrainingArguments(
        output_dir=output_dir,  # The output directory where the model predictions and checkpoints will be written.
        # output_dir='.',  # The output directory where the model predictions and checkpoints will be written.
        # num_train_epochs = num_train_epochs, 
        max_steps=max_steps,  # TODO: hard to fix, see above
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,  # based on alpaca https://github.com/tatsu-lab/stanford_alpaca, allows to process effective_batch_size = gradient_accumulation_steps * batch_size, num its to accumulate before opt update step
        gradient_checkpointing = gradient_checkpointing,  # TODO depending on hardware set to true?
        # optim=optim,
        # warmup_steps=int(max_steps*warmup_ratio),  # TODO: once real training starts we can select this number for llama v2, what does llama v2 do to make it stable while v1 didn't?
        # warmup_ratio=warmup_ratio,  # copying alpaca for now, number of steps for a linear warmup, TODO once real training starts change? 
        # weight_decay=0.01,  # TODO once real training change?
        weight_decay=weight_decay,  # TODO once real training change?
        learning_rate = learning_rate,  # TODO once real training change? anything larger than -3 I've had terrible experiences with
        max_grad_norm=1.0, # TODO once real training change?
        # lr_scheduler_type=lr_scheduler_type,  # TODO once real training change? using what I've seen most in vision 
        # lr_scheduler_kwargs=lr_scheduler_kwargs,  # ref: https://huggingface.co/docs/transformers/v4.37.0/en/main_classes/optimizer_schedules#transformers.SchedulerType 
        logging_dir=Path('~/data/maf/logs').expanduser(),
        # save_steps=4000,  # alpaca does 2000, other defaults were 500
        save_steps=max_steps//3,  # alpaca does 2000, other defaults were 500
        # save_steps=1,  # alpaca does 2000, other defaults were 500
        # logging_steps=250,
        # logging_steps=50,  
        logging_first_step=True,
        # logging_steps=3,
        logging_steps=1,
        remove_unused_columns=False,  # TODO don't get why https://stackoverflow.com/questions/76879872/how-to-use-huggingface-hf-trainer-train-with-custom-collate-function/76929999#76929999 , https://claude.ai/chat/475a4638-cee3-4ce0-af64-c8b8d1dc0d90
        report_to=report_to,  # change to wandb!
        fp16=False,  # never ever set to True
        bf16=torch.cuda.get_device_capability(torch.cuda.current_device())[0] >= 8,  # if >= 8 ==> brain float 16 available or set to True if you always want fp32
    )
    print(f'{pretrained_model_name_or_path=}\n{optim=}\n{learning_rate=}')

    # TODO: might be nice to figure our how llamav2 counts the number of token's they've trained on
    print(f'{train_dataset=}')
    # print(f'{eval_dataset=}')
    trainer = Trainer(
        model=model,
        args=training_args,  
        train_dataset=train_dataset,
        optimizers=(optimizer, lr_scheduler),
    )

    # - Train
    cuda_visible_devices = os.environ.get('CUDA_VISIBLE_DEVICES')
    if cuda_visible_devices is not None:
        print(f"CUDA_VISIBLE_DEVICES = {cuda_visible_devices}")
    trainer.train()
    trainer.save_model(output_dir=output_dir)  # TODO is this really needed? https://discuss.huggingface.co/t/do-we-need-to-explicity-save-the-model-if-the-save-steps-is-not-a-multiple-of-the-num-steps-with-hf/56745

related Using Cosine LR scheduler via TrainingArguments in Trainer - #8 by brando

Topic		Replies	Views
Trainer optimizer 🤗Transformers	11	8993	August 7, 2021
Using Cosine LR scheduler via TrainingArguments in Trainer Beginners	10	11618	June 3, 2024
Hyperparameters for lr_scheduler_type in Trainer Arguments Beginners	2	12647	March 5, 2024
Use torch.optim.lr_scheduler.CyclicLR with Trainer 🤗Transformers	0	425	May 12, 2023
How to use lr_scheduler_kwargs param in TrainingArguments? 🤗Transformers	6	35	June 25, 2025

How do use lr_scheduler

Related topics