query, key and values are used for attention. In bert we can extract attentions of tokens by making output_attentions=True for each layer and each head. How to extract the attention gradient for each layer and each head.
query, key and values are used for attention. In bert we can extract attentions of tokens by making output_attentions=True for each layer and each head. How to extract the attention gradient for each layer and each head.