[Question] How to optimize two loss alternately with gradient accumulation?

hkunzhe · August 21, 2023, 12:54pm

I want to update a model by optimizing two loss alternately with gradient accumulation like this

# Suppose gradient_accumulation is set to 2.
optimizer = optim(unet.parameters())
with accelerator.accumulate(unet):
    outputs = unet(input)
    loss1 = loss_func1(outputs)
    loss1.backward()
    optimizer.step() 
    optimizer.zero_grad()

with accelerator.accumulate(unet):
    outputs = unet(input)
    loss2 = loss_func2(outputs)
    loss2.backward()
    optimizer.step()
    optimizer.zero_grad()

Is this correct? It appears from the documentation that accelerator.accumulate will normalize the loss and then backpropagate without updating the gradient until reaching gradient_accumulation_steps. My main concern is that the gradients accumulated by two different losses for the same model will affect each other.

Hope to find some help here, thanks in advance.

muellerzr · August 21, 2023, 3:38pm

According to here, that seems like it would work fine: Multiple gradient updates with two separate losses and two classifiers sharing the same encoder - nlp - PyTorch Forums. During accumulation it’d be attached separately/referenced separately I believe, with wouldn’t cause them to entangle.

The key here is don’t call zero_grad() before if possible. So I think we can essentially rewrite this as:

# Suppose gradient_accumulation is set to 2.
optimizer = optim(unet.parameters())
with accelerator.accumulate(unet):
    outputs = unet(input)
    loss1 = loss_func1(outputs)
    loss1.backward(retain_graph=True)
    loss2 = loss_func2(outputs)
    loss2.backward()
    optimizer.step()
    optimizer.zero_grad()

And this should work/be okay, the key is ensuring retain_graph=True

hkunzhe · August 22, 2023, 8:54am

Hi, thank you for the reply!

I’m sorry I didn’t make my question clear. Optimizing two loss alternately (sequentially) means loss2 is computed w.r.t the model updated by optimizing loss1. I want to use accelerator.accumulate doing gradient accumulation with the pseudo code in Multiple gradient updates with two separate losses and two classifiers sharing the same encoder - #2 by srishti-git1110 - nlp - PyTorch Forums.

import itertools
import torch

params = [encoder.parameters(), fc1.parameters(), fc2.parameters()]
optimizer = torch.optim.Adam(itertools.chain(*params), lr=0.01)

# optimizer.step() twice is what I exactly want.
for batch_idx, batch in dataloader_instance:
     # calculate lcce and lwd
     lcce.backward()
     optimizer.step()
     optimizer.zero_grad()

     lwd = -1 * lwd
     lwd.backward()
     
     for param in encoder.parameters():
          param.grad = -1*beta*param.grad
     optimizer.step()
     optimizer.zero_grad()

However, your current code updates the model using the sum of the accumulated gradients of the sum of loss1 and loss2.

hkunzhe · August 22, 2023, 8:59am

By the way, is it necessary to set retain_graph=True keeping the compute graph w.r.t loss1? Since torch.Tensor.backward accumulates gradients in the leaves by default.

muellerzr · September 11, 2023, 4:46pm

My advice there relies on the solution posted just below that. If it truly does I don’t know, as I’m learning this like you in this fairly niche question. Perhaps also worth asking on that forum discussion as well

Topic		Replies	Views
Skip optimizer update when gradient norm is large with Accelerate gradient accumulation 🤗Accelerate	0	1119	November 10, 2023
How to obatin gradients on different GPUs to do custom accumulations Intermediate	0	283	September 2, 2023
Bug in gradient accumulation training_step in huggingface Trainer? 🤗Transformers	3	806	November 2, 2024
Is there a way to backpropagate through multiple steps while using Trainer API 🤗Transformers	1	250	July 9, 2021
Accelerator.backward(loss) never done! 🤗Accelerate	3	1564	March 9, 2023

[Question] How to optimize two loss alternately with gradient accumulation?

Related topics