Hi, this might not be possible, but I am looking for a way to get the gradients of the logits with respect to the attentions (on a per head and per layer basis).
Essentially, I am looking for dy/dA where “y” is the logit output and “A” is a self attention layer in a particular head. The model has already been trained/fine-tuned.
Is this possible to do?
Thanks!