Gradient accumulation loss compute

DaleMeng · June 4, 2024, 2:28am

Hello, everyone!
Suppose we have data [b,s,dim], I recently noticed that CrossEntropyLoss is (1) computed the average on all tokens (b * s) in a batch instead of (2) computing on each sentence and then compute the average.
Here is the code to compute loss for hugging_face transformers BertLMHeadModel

        sequence_output = outputs[0]
        prediction_scores = self.cls(sequence_output)
        lm_loss = None
        if labels is not None:
            # we are doing next-token prediction; shift prediction scores and input ids by one
            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
            labels = labels[:, 1:].contiguous()
            loss_fct = CrossEntropyLoss()
            lm_loss  = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))

I know that (1) and (2) have no difference in this situation. But when we apply gradient accumulation, I think the situation is different.

Suppose I have a batch_size 4, and the lengths of 4 sentences are 100,200,300,400.
With batch_size 4, the loss is computed on the average of total 1000 tokens.

But with batch_size = 1 and gradient accumulation = 4, I think the loss is different. We first compute the loss on each sentence separately and then compute the average, which means for the sentence of 100 tokens, we compute the loss average of 100 tokens and then divide by 4 and add it to total loss, the same for other 3 sentences, and I think the loss computed this way is different from that computed with batch_size =4.

Did I misunderstand something?

Topic		Replies	Views
Gradient accumulation averages over gradient 🤗Transformers	2	2098	November 12, 2020
Gradient accumulation gives different results compared to full batch Models	1	1263	December 15, 2023
[Question] How to optimize two loss alternately with gradient accumulation? 🤗Accelerate	4	1723	September 11, 2023
Bug in gradient accumulation training_step in huggingface Trainer? 🤗Transformers	3	1006	November 2, 2024
Is there a way to backpropagate through multiple steps while using Trainer API 🤗Transformers	1	255	July 9, 2021

Gradient accumulation loss compute

Related topics