Apologies if this is a stupid question. I want to look at attention in Llama2. A MWE is below:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model_inputs = tokenizer.encode("Cats chase dogs", return_tensors="pt").to("cuda:0")
output = model.generate(model_inputs, output_attentions=True, max_new_tokens=5, return_dict_in_generate=True)
print(output.attentions[0][0][0][0]) # prints attention weights of first head in first layer
output is:
tensor([[0.3464, 0.1967, 0.0438, 0.1245, 0.0281, 0.2606],
[0.3714, 0.3745, 0.0209, 0.1465, 0.0157, 0.0710],
[0.1448, 0.4541, 0.0274, 0.2500, 0.0149, 0.1087],
[0.2242, 0.3160, 0.0371, 0.2656, 0.0157, 0.1414],
[0.1242, 0.2456, 0.0464, 0.3104, 0.0118, 0.2615],
[0.1509, 0.1745, 0.0558, 0.2522, 0.0131, 0.3535]], device='cuda:0')
and I would have expected this to be triangular.