Say I’ve trained a BERT model for classification. I’d like to calculate the proportional impact each input token is having on the predicted output.
For example - and this is very general - if I have a model that labels input text as {‘about dogs’ : 0, ‘about cats’ : 1}, the following input sentence:
s = 'this is a sentence about a cat'
should output very close to:
1
HOWEVER, what I’d like is to calculate each input’s impact on that final prediction, e.g. (assuming we’re tokenizing on the level of words - which is not how it would be done in practice, I know):
{this : .01, is: .005, a : .02, sentence : .0003, about : [some other low prob], a: [another low prob], cat : 0.999999}
Intuitively I’d think this means running a forward pass with the input sentence, then looking at the backprop values? But I’m not quite sure how you’d do that. Thoughts?