Analysis of attention map

there are many research about KV cache drops,it based on low information of some tokens,but when analyzing attention score, I feel that my attention score is quite sparse and their values are also very low. I cannot obtain any valuable information, such as more attention on what kinds of tokens. I just see it pay more attention scores on specially tokens\punctuations,… Considering that a model has n layers and m attention head, how can I gain some valuable insights?
my task is to extracting important information from the input I provide

I got something similar to the left picture, more attention on specially tokens, punctuations,or local area. I feel that these attention scores are very low, but the results of the answers are still good. Is it possible that some attention heads played an important role? I really want to discover which tokens require greater attention for my task that help me to save some memory.

2 Likes

What does accumulative attention mean?

the sum attention scores of special token/locality

1 Like