I’m analyzing the Class Trainer source code and have a question regarding the training_step function. In this function, it appears that the backward pass is performed before the function returns the loss. Specifically, after calculating the loss, the code calls scaled_loss.backward() (or self.accelerator.backward(loss)) before returning loss.detach() / self.args.gradient_accumulation_steps.
Does this mean that the function first performs backpropagation on the computed loss and then divides the loss by self.args.gradient_accumulation_steps before returning it? How does this fit into the gradient accumulation strategy?
The source code is very short. I hope you can help me take a look.
Thanks for your help!
Link to the source code of the training_step function.