Bug in gradient accumulation training_step in huggingface Trainer?

nguyen-brat · November 1, 2024, 11:13am

Hello, when I use huggingface trainer and apply gradient accumulation to loss it is significantly bigger than when I not use it, so I checked the code but there is a confusion in line 3604 of Trainer class in training_step function, they multiple loss by gradient_accumulation_steps then when return they just device it by the same gradient_accumulation_steps so what is the purpose of this after all ? (my transformers version is 4.46.1)

John6666 · November 2, 2024, 1:14am

The github version has been slightly changed…
I don’t know if it’s fixed or not.

github.com

huggingface/transformers/blob/main/src/transformers/trainer.py#L3645


      
              if self.use_apex:
                  with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                      scaled_loss.backward()
              else:
                  if num_items_in_batch is not None:
                      if self.compute_loss_func or self.model_accepts_loss_kwargs:
                          loss *= self.args.gradient_accumulation_steps
                      # Average tokens across devices is orthogonal to gradient accumulation
                      if self.args.average_tokens_across_devices:
                          loss *= self.args.world_size
                  self.accelerator.backward(loss, **kwargs)
          
              return loss.detach() / self.args.gradient_accumulation_steps
          
          def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
              """
              How the loss is computed by Trainer. By default, all models return the loss in the first element.
          
              Subclass and override for custom behavior.
              """
              if (self.label_smoother is not None or self.compute_loss_func is not None) and "labels" in inputs:

I don’t think the part in question has changed…
I wonder what is going right…

nguyen-brat · November 2, 2024, 2:40am

when I update the dev version from GitHub directly, the loss doesn’t multiply with the number of gradient accumulations anymore, it seems like some dev has messed it up in the 4.46.1 version . But do you have any idea why when the custom loss function is provided we need to multiply the loss by the gradient accumulation_steps? Moreover, it is crazy that when I update to 4.47.0.dev0 version the training time is reduced from 14 hours to 5 hours, I really don’t know what is going on here

John6666 · November 2, 2024, 3:03am

No, I have no idea!
I know I’ve never used the trainer function, but that’s not the point, this means that this is probably a tough issue.

I guess they are trying to fix…
The following commitments, for example, are just that.
But it’s probably not completely fixed yet. They might.
Even today, there are easy bugs here and there in various libraries that are routinely fixed. If information is not smoothly provided from the user side to the developer side, bugs will not be noticed and fixed…

But I don’t think we have an issue on this question yet…
I haven’t even had a github account for a week, so I can’t say for sure…

https://github.com/huggingface/transformers/pulls

Topic		Replies	Views
Question about training_step Function in Class Trainer 🤗Transformers	0	38	September 23, 2024
Some question about training_step Function in Class Trainer Beginners	0	17	September 24, 2024
Is there a way to backpropagate through multiple steps while using Trainer API 🤗Transformers	1	250	July 9, 2021
Trainer's step loss always drops sharply after each epoch regardless of model / data 🤗Transformers	3	2157	March 28, 2023
HuggingFace Trainer - Eval loss abruptly goes up at the last step of training 🤗Transformers	1	1988	November 8, 2022

Bug in gradient accumulation training_step in huggingface Trainer?

Related topics