Hi, I am training a diffusion model. I am facing a weird issue where the gradient tracking is not working.
If I run this code segement
noise_pred = self.unet(
latent_model_input,
t,
encoder_hidden_states=prompt_embeds,
cross_attention_kwargs=cross_attention_kwargs,
return_dict=False,
)[0]
I get the output for noise_pred.requires_grad
as True. But if I run this instead
with torch.no_grad():
noise_pred = self.unet(
latent_model_input,
t,
encoder_hidden_states=prompt_embeds,
cross_attention_kwargs=cross_attention_kwargs,
return_dict=False,
)[0]
noise_pred = self.unet(
latent_model_input,
t,
encoder_hidden_states=prompt_embeds,
cross_attention_kwargs=cross_attention_kwargs,
return_dict=False,
)[0]
print(noise_pred.requires_grad)
I get the result to be False. It seems that using the Unet one time under the torch.no_grad context turns of gradient tracking somehow. I checked the requires_grad
for parameters in the Unet and it was True.
Thanks in advance
1 Like
with torch.no_grad():
It seems that tensors originating from the block with this with
statement may continue to be affected by no_grad
. In addition, there appear to be cases where the effect remains due to caching…
https://stackoverflow.com/questions/63785319/pytorch-torch-no-grad-versus-requires-grad-false
I understand that using torch.no_grad has the benefit of saving memory and computation because the network won’t backpropagate gradients to layers before torch.no_grad. So, is the following statement correct: If I have a network and only want to...
Reading time: 1 mins 🕑
Likes: 3 ❤
import torch from torch.cuda.amp import autocast net = torch.nn.Conv2d(3,3,3,3).to('cuda') input = torch.rand([3,3,5,5],device='cuda') with autocast(): with torch.no_grad(): y = net(input) z = net(input) print('z...
Reading time: 1 mins 🕑
Likes: 5 ❤
opened 03:12PM - 01 Nov 23 UTC
high priority
module: autograd
triaged
actionable
module: amp (automated mixed precision)
Hmm, I actually think your example is unexpected, can you file a separate bug fo… r it.
_Originally posted by @ezyang in https://github.com/pytorch/pytorch/issues/105211#issuecomment-1787772959_
If I'm in an autocast context and I do a forward pass of a model in no_grad, then do another forward pass outside of that no_grad context but still in the autocast context, arbitrary nodes in the graph will be lost. See example:
```Python
import torch
import torch.nn as nn
l1 = nn.Sequential(
nn.Linear(2, 2),
).cuda()
l2 = nn.Sequential(
nn.Linear(2, 2),
nn.LayerNorm(2),
).cuda()
x = torch.randn(2, 2).cuda()
#################################
# Just linear
################################
with torch.cuda.amp.autocast():
with torch.no_grad():
y1 = l1(x)
y1_2 = l1(x)
print(y1_2.grad_fn) # None
# Remove autocast
with torch.no_grad():
y1 = l1(x)
y1_2 = l1(x)
print(y1_2.grad_fn) # AddmmBackward
#################################
# Linear -> LayerNorm makes output
# have grad_fn
################################
with torch.cuda.amp.autocast():
with torch.no_grad():
y2 = l2(x)
y2_2 = l2(x)
print(y2_2.grad_fn) # LayerNormBackward
with torch.no_grad():
y2 = l2(x)
y2_2 = l2(x)
print(y2_2.grad_fn) # Still LayerNormBackward
with torch.cuda.amp.autocast():
with torch.no_grad():
y2 = l2(x)
y2_2 = l2(x)
print(y2_2.grad_fn) # LayerNormBackward
y2_2.sum().backward()
print(l2[0].weight.grad) # None
print(l2[1].weight.grad) # Not none
```
PyTorch version: 2.1.0
cc @ezyang @gchanan @zou3519 @kadeng @albanD @gqchen @pearu @nikitaved @soulitzer @Lezcano @Varal7 @mcarilli @ptrblck @leslie-fang-intel @jgong5
1 Like