Gradient Issue after using torch.no_grad

sasuke-ss1 · July 27, 2025, 4:25pm

Hi, I am training a diffusion model. I am facing a weird issue where the gradient tracking is not working.

If I run this code segement

noise_pred = self.unet(
    latent_model_input,
    t,
    encoder_hidden_states=prompt_embeds,
    cross_attention_kwargs=cross_attention_kwargs,
    return_dict=False,
)[0]

I get the output for noise_pred.requires_grad as True. But if I run this instead

with torch.no_grad():
    noise_pred = self.unet(
        latent_model_input,
        t,
        encoder_hidden_states=prompt_embeds,
        cross_attention_kwargs=cross_attention_kwargs,
        return_dict=False,
    )[0]
  
noise_pred = self.unet(
    latent_model_input,
    t,
    encoder_hidden_states=prompt_embeds,
    cross_attention_kwargs=cross_attention_kwargs,
    return_dict=False,
)[0]
print(noise_pred.requires_grad)

I get the result to be False. It seems that using the Unet one time under the torch.no_grad context turns of gradient tracking somehow. I checked the requires_grad for parameters in the Unet and it was True.

Thanks in advance

John6666 · July 27, 2025, 10:25pm

with torch.no_grad():

It seems that tensors originating from the block with this with statement may continue to be affected by no_grad. In addition, there appear to be cases where the effect remains due to caching…
https://stackoverflow.com/questions/63785319/pytorch-torch-no-grad-versus-requires-grad-false

github.com/pytorch/pytorch

Nesting no_grad in autocast causes backwards graph to be (partially) lost outside of no_grad

opened 03:12PM - 01 Nov 23 UTC

haydn-jones

high priority module: autograd triaged actionable module: amp (automated mixed precision)

Hmm, I actually think your example is unexpected, can you file a separate bug fo…r it. _Originally posted by @ezyang in https://github.com/pytorch/pytorch/issues/105211#issuecomment-1787772959_ If I'm in an autocast context and I do a forward pass of a model in no_grad, then do another forward pass outside of that no_grad context but still in the autocast context, arbitrary nodes in the graph will be lost. See example: ```Python import torch import torch.nn as nn l1 = nn.Sequential( nn.Linear(2, 2), ).cuda() l2 = nn.Sequential( nn.Linear(2, 2), nn.LayerNorm(2), ).cuda() x = torch.randn(2, 2).cuda() ################################# # Just linear ################################ with torch.cuda.amp.autocast(): with torch.no_grad(): y1 = l1(x) y1_2 = l1(x) print(y1_2.grad_fn) # None # Remove autocast with torch.no_grad(): y1 = l1(x) y1_2 = l1(x) print(y1_2.grad_fn) # AddmmBackward ################################# # Linear -> LayerNorm makes output # have grad_fn ################################ with torch.cuda.amp.autocast(): with torch.no_grad(): y2 = l2(x) y2_2 = l2(x) print(y2_2.grad_fn) # LayerNormBackward with torch.no_grad(): y2 = l2(x) y2_2 = l2(x) print(y2_2.grad_fn) # Still LayerNormBackward with torch.cuda.amp.autocast(): with torch.no_grad(): y2 = l2(x) y2_2 = l2(x) print(y2_2.grad_fn) # LayerNormBackward y2_2.sum().backward() print(l2[0].weight.grad) # None print(l2[1].weight.grad) # Not none ``` PyTorch version: 2.1.0 cc @ezyang @gchanan @zou3519 @kadeng @albanD @gqchen @pearu @nikitaved @soulitzer @Lezcano @Varal7 @mcarilli @ptrblck @leslie-fang-intel @jgong5

Topic		Replies	Views
Help with wiping gradients from UNet2DConditionModel 🧨 Diffusers	0	276	December 11, 2023
Model is not properly moved to GPU memory with torch.no_grad() Beginners	5	4893	August 24, 2022
Loss.backward() problems with require_grad Beginners	1	3982	August 27, 2020
Freezing layers when using gradient checkpointing 🤗Transformers	0	725	March 20, 2022
BERTology compute_heads_importance without zero grad Intermediate	0	324	October 7, 2020

Gradient Issue after using torch.no_grad

Related topics