Does gradient accumulation also accumulate loss?

mayank64ce · September 24, 2024, 2:04pm

I am trying to finetune Microsoft phi-2 on the WikiText2 dataset. I also applied L1-Regularization on the weights of the phi-2 model.

But when I trained it, after some training steps, the loss would explode suddenly and the gradient became “inf” along with the loss, so the model couldn’t train.

I noticed that I had used gradient_accumulation_steps=4 for faster training. When I turned it down to 2, so far the loss hasn’t exploded yet.

So my question is does gradient accumulation also accumulates loss ?

Sidenote: I am using float16 precision.

Chahnwoo · September 25, 2024, 1:43am

As you know, gradient accumulation reduces training time by reducing the number of times model weight updates are performed. Since you initially set gradient_accumulation_steps=4, the model weights would only be updated once for every 4 forward and backward passes.

Setting a large value for gradient_accumulation_steps can thus lead to instability. Because the gradients at each update are actually the accumulated gradients over n steps, they are larger, and occasionally large enough that the model paramaters overshoot. Think of the classical issue of setting the right learning rate for gradient descent:

This is similar to the way that a large gradient_accumulation_steps value can cause overshooting.

There’s also the fact that you are using L1-Regularization, which can exacerbate the instability of the gradient updates, since the penalization terms do indirectly modify the gradients.

Returning to your question, it isn’t that the gradient accumulation also accumulates loss. Loss is calculated during each forward cycle of model training, and the gradients for that loss are calculated during backpropagation. Only the gradients are accumulated. The issue you are facing is likely due to the fact that accumulating gradients for a large number of steps leads to unstable loss that can explode, as you’ve seen.

I would look into “exploding gradients” if you want to dive deeper. Here’s a link if you’re interested : Exploding Gradient Problem Definition | DeepAI

mayank64ce · September 25, 2024, 1:57am

Wow yeah I think that makes more sense. Its not that the loss is being accumulated, it must be that the gradient update is so high which inturn makes the loss go up.

Thanks @Chahnwoo !

system · September 25, 2024, 1:57pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bug in gradient accumulation training_step in huggingface Trainer? 🤗Transformers	3	812	November 2, 2024
Performing gradient accumulation with Accelerate 🤗Accelerate	3	574	March 4, 2024
Gradient accumulation gives different results compared to full batch Models	1	1178	December 15, 2023
[Question] How to optimize two loss alternately with gradient accumulation? 🤗Accelerate	4	1671	September 11, 2023
Is there a standard way to handle leftover batches when using gradient accumulation? Intermediate	1	616	November 22, 2021

Does gradient accumulation also accumulate loss?

Related topics