I want to update a model by optimizing two loss alternately with gradient accumulation like this

```
# Suppose gradient_accumulation is set to 2.
optimizer = optim(unet.parameters())
with accelerator.accumulate(unet):
outputs = unet(input)
loss1 = loss_func1(outputs)
loss1.backward()
optimizer.step()
optimizer.zero_grad()
with accelerator.accumulate(unet):
outputs = unet(input)
loss2 = loss_func2(outputs)
loss2.backward()
optimizer.step()
optimizer.zero_grad()
```

Is this correct? It appears from the documentation that `accelerator.accumulate`

will normalize the loss and then backpropagate without updating the gradient until reaching `gradient_accumulation_steps`

. My main concern is that the gradients accumulated by two different losses for the same model will affect each other.

Hope to find some help here, thanks in advance.

1 Like

According to here, that seems like it would work fine: Multiple gradient updates with two separate losses and two classifiers sharing the same encoder - nlp - PyTorch Forums. During accumulation itâ€™d be attached separately/referenced separately I believe, with wouldnâ€™t cause them to entangle.

The key here is *donâ€™t* call `zero_grad()`

before if possible. So I think we can essentially rewrite this as:

```
# Suppose gradient_accumulation is set to 2.
optimizer = optim(unet.parameters())
with accelerator.accumulate(unet):
outputs = unet(input)
loss1 = loss_func1(outputs)
loss1.backward(retain_graph=True)
loss2 = loss_func2(outputs)
loss2.backward()
optimizer.step()
optimizer.zero_grad()
```

And this should work/be okay, the key is ensuring `retain_graph=True`

1 Like

Hi, thank you for the reply!

Iâ€™m sorry I didnâ€™t make my question clear. Optimizing two loss alternately (sequentially) means `loss2`

is computed w.r.t the model updated by optimizing `loss1`

. I want to use `accelerator.accumulate`

doing gradient accumulation with the pseudo code in Multiple gradient updates with two separate losses and two classifiers sharing the same encoder - #2 by srishti-git1110 - nlp - PyTorch Forums.

```
import itertools
import torch
params = [encoder.parameters(), fc1.parameters(), fc2.parameters()]
optimizer = torch.optim.Adam(itertools.chain(*params), lr=0.01)
# optimizer.step() twice is what I exactly want.
for batch_idx, batch in dataloader_instance:
# calculate lcce and lwd
lcce.backward()
optimizer.step()
optimizer.zero_grad()
lwd = -1 * lwd
lwd.backward()
for param in encoder.parameters():
param.grad = -1*beta*param.grad
optimizer.step()
optimizer.zero_grad()
```

However, your current code updates the model using the sum of the accumulated gradients of the sum of `loss1`

and `loss2`

.

By the way, is it necessary to set `retain_graph=True`

keeping the compute graph w.r.t loss1? Since torch.Tensor.backward accumulates gradients in the leaves by default.

My advice there relies on the solution posted just below that. If it truly does I donâ€™t know, as Iâ€™m learning this like you in this fairly niche question. Perhaps also worth asking on that forum discussion as well

1 Like