I want to update a model by optimizing two loss alternately with gradient accumulation like this
# Suppose gradient_accumulation is set to 2.
optimizer = optim(unet.parameters())
with accelerator.accumulate(unet):
outputs = unet(input)
loss1 = loss_func1(outputs)
loss1.backward()
optimizer.step()
optimizer.zero_grad()
with accelerator.accumulate(unet):
outputs = unet(input)
loss2 = loss_func2(outputs)
loss2.backward()
optimizer.step()
optimizer.zero_grad()
Is this correct? It appears from the documentation that accelerator.accumulate
will normalize the loss and then backpropagate without updating the gradient until reaching gradient_accumulation_steps
. My main concern is that the gradients accumulated by two different losses for the same model will affect each other.
Hope to find some help here, thanks in advance.
1 Like
According to here, that seems like it would work fine: Multiple gradient updates with two separate losses and two classifiers sharing the same encoder - nlp - PyTorch Forums. During accumulation it’d be attached separately/referenced separately I believe, with wouldn’t cause them to entangle.
The key here is don’t call zero_grad()
before if possible. So I think we can essentially rewrite this as:
# Suppose gradient_accumulation is set to 2.
optimizer = optim(unet.parameters())
with accelerator.accumulate(unet):
outputs = unet(input)
loss1 = loss_func1(outputs)
loss1.backward(retain_graph=True)
loss2 = loss_func2(outputs)
loss2.backward()
optimizer.step()
optimizer.zero_grad()
And this should work/be okay, the key is ensuring retain_graph=True
1 Like
Hi, thank you for the reply!
I’m sorry I didn’t make my question clear. Optimizing two loss alternately (sequentially) means loss2
is computed w.r.t the model updated by optimizing loss1
. I want to use accelerator.accumulate
doing gradient accumulation with the pseudo code in Multiple gradient updates with two separate losses and two classifiers sharing the same encoder - #2 by srishti-git1110 - nlp - PyTorch Forums.
import itertools
import torch
params = [encoder.parameters(), fc1.parameters(), fc2.parameters()]
optimizer = torch.optim.Adam(itertools.chain(*params), lr=0.01)
# optimizer.step() twice is what I exactly want.
for batch_idx, batch in dataloader_instance:
# calculate lcce and lwd
lcce.backward()
optimizer.step()
optimizer.zero_grad()
lwd = -1 * lwd
lwd.backward()
for param in encoder.parameters():
param.grad = -1*beta*param.grad
optimizer.step()
optimizer.zero_grad()
However, your current code updates the model using the sum of the accumulated gradients of the sum of loss1
and loss2
.
By the way, is it necessary to set retain_graph=True
keeping the compute graph w.r.t loss1? Since torch.Tensor.backward accumulates gradients in the leaves by default.
My advice there relies on the solution posted just below that. If it truly does I don’t know, as I’m learning this like you in this fairly niche question. Perhaps also worth asking on that forum discussion as well
1 Like