I am trying to finetune Microsoft phi-2 on the WikiText2 dataset. I also applied L1-Regularization on the weights of the phi-2 model.
But when I trained it, after some training steps, the loss would explode suddenly and the gradient became “inf” along with the loss, so the model couldn’t train.
I noticed that I had used gradient_accumulation_steps=4 for faster training. When I turned it down to 2, so far the loss hasn’t exploded yet.
So my question is does gradient accumulation also accumulates loss ?
As you know, gradient accumulation reduces training time by reducing the number of times model weight updates are performed. Since you initially set gradient_accumulation_steps=4, the model weights would only be updated once for every 4 forward and backward passes.
Setting a large value for gradient_accumulation_steps can thus lead to instability. Because the gradients at each update are actually the accumulated gradients over n steps, they are larger, and occasionally large enough that the model paramaters overshoot. Think of the classical issue of setting the right learning rate for gradient descent:
This is similar to the way that a large gradient_accumulation_steps value can cause overshooting.
There’s also the fact that you are using L1-Regularization, which can exacerbate the instability of the gradient updates, since the penalization terms do indirectly modify the gradients.
Returning to your question, it isn’t that the gradient accumulation also accumulates loss. Loss is calculated during each forward cycle of model training, and the gradients for that loss are calculated during backpropagation. Only the gradients are accumulated. The issue you are facing is likely due to the fact that accumulating gradients for a large number of steps leads to unstable loss that can explode, as you’ve seen.
Wow yeah I think that makes more sense. Its not that the loss is being accumulated, it must be that the gradient update is so high which inturn makes the loss go up.