I am trying to find the gradient of the output of a layer of BERT to its inputs, token wise. But I keep getting the error saying: âRuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.â Below is the code snippet:
for count, data in enumerate(iter(data_loader)):
input_ids=torch.squeeze(data[âinput_idsâ],dim=0)
attention_mask=torch.squeeze(data[âattention_maskâ],dim=0)
last_hidden_state, pooled_output, hidden_states = bert_model(input_ids=input_ids,attention_mask=attention_mask)
bert_layer_i_output=hidden_states[i][0]
print(bert_layer_i_output.shape)
bert_layer_j_output=hidden_states[j][0]
#print(torch.autograd.grad(bert_layer_j_output,bert_layer_i_output,retain_graph=True, create_graph=True))
for k in range(bert_layer_i_output.shape[0]):
gradient=torch.autograd.grad(bert_layer_j_output[k],bert_layer_i_output[k],grad_outputs=torch.ones_like(bert_layer_j_output[k]))
print(gradient.shape)
print(torch.norm(gradient))
break
break
Below is the stack trace of the error:
/usr/local/lib/python3.6/dist-packages/torch/autograd/ init .py in grad(outputs, inputs, grad_outputs, retain_graph, create_graph, only_inputs, allow_unused)
202 return Variable. execution_engine.run_backward(
203 outputs, grad_outputs , retain_graph, create_graph,
â 204 inputs, allow_unused)
205
206
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
Am i doing something wrong? Ideally both the tensors should be part of the same computational graph right?