Loss.backward() problems with require_grad

“element 0 of tensors does not require grad and does not have a grad_fn”

Is it likely to be a problem with the model, the loss function, or the shape of the data tensors?

I have a dataset of texts, each with an associated real value. I want to fine-tune a bert model to these, and then visualize the attention weights (for any specific text) and how they are altered by the fine-tuning. I have defined a model based on transformers BertModel, then passing the pooled_output (=CLS token) through two more dense layers, the first with ReLU and the second with Sigmoid.

I’ve been following abhimishra https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb and ChrisMcCormick https://mccormickml.com/2019/07/22/BERT-fine-tuning/#4-train-our-classification-model , but I’m stuck with the .backward() step.

I’ve tried calculating the loss as part of the forward pass (within the model class definition) or outside of that, but I get the same error: “element 0 of tensors does not require grad and does not have a grad_fn”.

Model

class ATBertClass(torch.nn.Module):

def __init__(self):
    super(ATBertClass, self).__init__()
    self.L1bb = transformers.BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
    self.L2Lin = torch.nn.Linear(768,64)
    self.L3Rel = torch.nn.ReLU()
    self.L4Lin = torch.nn.Linear(64,1)
    self.L5Sig = torch.nn.Sigmoid()

def forward(self, input_ids, attention_mask, labels):
        _, output_1, attns = self.L1bb(input_ids = input_ids, 
                                      attention_mask = attention_mask)   
        output_2 = self.L2Lin(output_1)    
        output_3 = self.L3Rel(output_2)
        output_4 = self.L4Lin(output_3)
        output_5 = self.L5Sig(output_4)
       return output_5, attns

Is there anything obviously wrong with the model definition?
(I’m not sure about the torch.nn.ReLU and Sigmoid layers).

Can anyone advise?

I’ve found the problem: I needed to set dtype=torch.float on the target tensor.

(Any advice on the model definition would still be welcome!)