Why is grad norm clipping done during training by default?

I know that gradient clipping is useful for preventing exploding gradients, is this is reason why it is there by default? Or does this improve overall model training quality?

Why is norm clipping used instead of the alternatives?

It usually improves the training (and is pretty much always done in the fine-tuning scripts of research papers), which is why we use it by default. Norm clipping is the most commonly use, you can always try alternatives and see if it yields better results.

1 Like

How to change the clip gradient type using Training Arguments? @sgugger