In distilbart, can i identify the weight of the words in the sequence associated to the candidate label/class. I want to narrow down on the reason for the model assigning a particular score to a given class. For example if “This is awesome anyone thinking to buy should blindly go for it” is assigned a positive label score of 0.99, then is it possible to identify the words in the sequence which carry the most weight/ contribute the most to the positive label, ( in this sequence the words/phrases - awesome, blindly go for it ) and the relative weight(cosine similarity/distance) of those words to the identified class(i.e positive).

Can this be done by accessing or manipulating the end layers of the model or by any other method?
Thank you for your help in advance!

@joeddav might have some ideas here.

You can look at the attention weights. This is not exactly “explanation”, but also not entirely not explanation either :slight_smile: That is, it can give you some notion of how “important” a word is, but is very limited in its explanatory power. Models like these are difficult to interpret.

Pass output_attentions=True to get the attention patterns back when you call the model. You’ll probably want the attention between each word and the EOS token since this embedding is the sequence representation that BartForSequenceClassification uses. But which layers and heads are best to use, or how to aggregate their patterns, goes beyond my knowledge.

As an aside for any interested readers, we have an awesome tool for visualizing transformers which is really helpful to build intuition about attention patters in transformers.


Hi thank you for your response, ill try to take a deep dive into those that seems like a good place to start. :smiley:

Hi there is a library released by Pytorch for Model interpretability. It uses the internal embeddings and integrated gradient functions to give the saliency features and attributions of the input words. They have shown its applications for BERT model trained on SQuaD for Question Answering. Is it possible to do the same on these BART models? I am not really well versed with how the pipeline works, is it possible to run it on the distilbart zero-shot models? I am presuming I might have to replicate the zero-shot pipeline and pass the model through their interpretation functions.

I don’t have a complete answer for you, but don’t worry about the pipeline itself as its just a wrapper for the model. I assume this code is designed for BertModel, so the hardest thing you’ll probably need to do is adapt it for BartModel or BartForSequenceClassification.