Use ReduceLROnPlateau with deepspeed

BoltzmachineQ · June 20, 2025, 2:42am

Huggingface trainer always let deepspeed to handle the scheduler, and deepspeed step the scheduler at each training step, which is wrong for ReduceLROnPlateau. But deepspeed said that you can manage your scheduler outside deepspeed. So how to do it when using huggingface trainer

mahmutc · June 20, 2025, 8:38am

Hi @BoltzmachineQ

I think this page might help if you’re using TrainingArguments:

You can also find more information here:

BoltzmachineQ · June 25, 2025, 6:38pm

Thanks but of course I read documents before I come here to ask a question… Anyways I solve it myself…

John6666 · June 26, 2025, 12:42am

I’ll read the source code…

github.com/huggingface/transformers

Add Trainer support for ReduceLROnPlateau

main ← pie3636:trainer_reducelronplateau

opened 03:01PM - 26 Apr 23 UTC

pie3636

+103 -4

# What does this PR do? This PR solves #16503 by adding support to pytorch's …[ReduceLROnPlateau](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html) to `Trainer`. It does so by adding a new `REDUCE_ON_PLATEAU` field to `SchedulerType` and a new `reduce_lr_on_plateau_args` parameter to `TrainingArguments` that is parsed at initialization to avoid adding 9 new individual arguments. The scheduler re-uses the metric stored in `metric_for_best_model`, and is delayed to run after evaluation since it requires metrics to be populated. I'm not sure whether it is due to the complexity of `Trainer`, my lack of experience (this is my first PR to a large project) or the uniqueness of `ReduceLROnPlateau` compared to other schedulers, but this PR feels a bit hacky, so I welcome any feedback. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [x] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [x] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. Looking at #16503, I believe this is for @sgugger.

John6666 · June 26, 2025, 12:47am

Except for evaluation, it seems that the scheduler is not .step() during ReduceLROnPlateau.

github.com/huggingface/transformers

src/transformers/trainer.py

main


      
          
          self.optimizer.step()
          
          self.control = self.callback_handler.on_optimizer_step(args, self.state, self.control)
          
          # get leaning rate before update
          learning_rate = self._get_learning_rate()
          
          if not self.accelerator.optimizer_step_was_skipped:
              # Delay optimizer scheduling until metrics are generated
              if not isinstance(self.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
                  self.lr_scheduler.step()
          
          model.zero_grad()
          self.state.global_step += 1
          self.state.epoch = epoch + (step + 1 + steps_skipped) / steps_in_epoch
          self.control = self.callback_handler.on_step_end(args, self.state, self.control)
          self._maybe_log_save_evaluate(
              tr_loss,
              grad_norm,
              model,

Topic		Replies	Views
Use torch.optim.lr_scheduler.CyclicLR with Trainer 🤗Transformers	0	421	May 12, 2023
Which is actually used to configure scheduler in deepspeed and TrainingArguments? 🤗Transformers	0	93	May 17, 2024
[Solved] Cannot restart training from deepspeed checkpoint Intermediate	3	2680	December 28, 2023
Corrupted deepspeed checkpoint DeepSpeed	1	153	March 13, 2025
Linear Learning Rate Warmup with step-decay Beginners	4	3267	April 21, 2021

Use ReduceLROnPlateau with deepspeed

Related topics