Huggingface trainer always let deepspeed to handle the scheduler, and deepspeed step the scheduler at each training step, which is wrong for ReduceLROnPlateau. But deepspeed said that you can manage your scheduler outside deepspeed. So how to do it when using huggingface trainer
1 Like
Hi @BoltzmachineQ
I think this page might help if you’re using TrainingArguments:
You can also find more information here:
1 Like
Thanks but of course I read documents before I come here to ask a question… Anyways I solve it myself…
1 Like
I’ll read the source code…
main
← pie3636:trainer_reducelronplateau
opened 03:01PM - 26 Apr 23 UTC
# What does this PR do?
This PR solves #16503 by adding support to pytorch's … [ReduceLROnPlateau](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html) to `Trainer`.
It does so by adding a new `REDUCE_ON_PLATEAU` field to `SchedulerType` and a new `reduce_lr_on_plateau_args` parameter to `TrainingArguments` that is parsed at initialization to avoid adding 9 new individual arguments. The scheduler re-uses the metric stored in `metric_for_best_model`, and is delayed to run after evaluation since it requires metrics to be populated.
I'm not sure whether it is due to the complexity of `Trainer`, my lack of experience (this is my first PR to a large project) or the uniqueness of `ReduceLROnPlateau` compared to other schedulers, but this PR feels a bit hacky, so I welcome any feedback.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [x] Did you make sure to update the documentation with your changes? Here are the
[documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and
[here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [x] Did you write any new necessary tests?
## Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Looking at #16503, I believe this is for @sgugger.
Except for evaluation, it seems that the scheduler is not .step()
during ReduceLROnPlateau
.
self.optimizer.step()
self.control = self.callback_handler.on_optimizer_step(args, self.state, self.control)
# get leaning rate before update
learning_rate = self._get_learning_rate()
if not self.accelerator.optimizer_step_was_skipped:
# Delay optimizer scheduling until metrics are generated
if not isinstance(self.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
self.lr_scheduler.step()
model.zero_grad()
self.state.global_step += 1
self.state.epoch = epoch + (step + 1 + steps_skipped) / steps_in_epoch
self.control = self.callback_handler.on_step_end(args, self.state, self.control)
self._maybe_log_save_evaluate(
tr_loss,
grad_norm,
model,