Why is grad norm clipping done during training by default?

reSearch2vec · November 3, 2020, 6:49am

huggingface/transformers/blob/master/src/transformers/trainer.py#L789


    # last step in epoch but step is always smaller than gradient_accumulation_steps
    steps_in_epoch <= self.args.gradient_accumulation_steps
    and (step + 1) == steps_in_epoch
):
    if self.args.fp16 and _use_native_amp:
        self.scaler.unscale_(self.optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), self.args.max_grad_norm)
    elif self.args.fp16 and _use_apex:
        torch.nn.utils.clip_grad_norm_(amp.master_params(self.optimizer), self.args.max_grad_norm)
    else:
        torch.nn.utils.clip_grad_norm_(model.parameters(), self.args.max_grad_norm)

    if is_torch_tpu_available():
        xm.optimizer_step(self.optimizer)
    elif self.args.fp16 and _use_native_amp:
        self.scaler.step(self.optimizer)
        self.scaler.update()
    else:
        self.optimizer.step()

    self.lr_scheduler.step()

I know that gradient clipping is useful for preventing exploding gradients, is this is reason why it is there by default? Or does this improve overall model training quality?

Why is norm clipping used instead of the alternatives?

sgugger · November 3, 2020, 1:53pm

It usually improves the training (and is pretty much always done in the fine-tuning scripts of research papers), which is why we use it by default. Norm clipping is the most commonly use, you can always try alternatives and see if it yields better results.

pchhapolika · March 23, 2023, 7:46am

How to change the clip gradient type using Training Arguments? @sgugger

nitishpandey04 · February 17, 2025, 9:46am

Using max_grad_norm argument in Trainer

Link:

Topic		Replies	Views
Error in clip_grad_norm_ for bf16 via PEFT 🤗Accelerate	1	1411	June 23, 2023
FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_;results in 0 error Beginners	0	2428	October 22, 2021
Skip optimizer update when gradient norm is large with Accelerate gradient accumulation 🤗Accelerate	0	1112	November 10, 2023
What to do for non-finite warning in `clip_grad_norm`? 🤗Transformers	3	1840	September 13, 2021
Do I need to specify the prediction_step in my customized trainer? Beginners	0	555	May 21, 2023

Why is grad norm clipping done during training by default?

Related topics