Hi~
I want to use attention maps to visualize the relationships between tokens like this map:
Then I set the keywords output_attentions=True, return_dict_in_generate=True, hoping to get the corresponding attention map.
The code are presented as follows:
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True)
model.to("cuda:0")
print("done")
path = "/mnt/workspace/workgroup/lz/111.jpg"
image = Image.open(path)
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(image, prompt, return_tensors="pt").to("cuda:0")
# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=256, output_attentions=True, return_dict_in_generate=True)
for attn in output.attentions[-1]:
print(attn.size()) # 1, 32, 1 ,2353
But it does not follow the shape “(batch_size, num_heads, sequence_length, sequence_length)” mentioned in transformers/src/transformers/models/llava_next/modeling_llava_next.py line 177
Ideally, the output attention map should be something like [1,32,768,768] in size instead of [1,32,1,2535].
How can I get the attention map in normal 2D size? Why do I get outputs of size [1,32,1,2353] ?