Why are Llama2 attention weights not lower triangular?

MarthaL · May 3, 2024, 9:40pm

Apologies if this is a stupid question. I want to look at attention in Llama2. A MWE is below:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

model_inputs = tokenizer.encode("Cats chase dogs", return_tensors="pt").to("cuda:0")

output = model.generate(model_inputs, output_attentions=True, max_new_tokens=5, return_dict_in_generate=True)

print(output.attentions[0][0][0][0]) # prints attention weights of first head in first layer

output is:
tensor([[0.3464, 0.1967, 0.0438, 0.1245, 0.0281, 0.2606],
[0.3714, 0.3745, 0.0209, 0.1465, 0.0157, 0.0710],
[0.1448, 0.4541, 0.0274, 0.2500, 0.0149, 0.1087],
[0.2242, 0.3160, 0.0371, 0.2656, 0.0157, 0.1414],
[0.1242, 0.2456, 0.0464, 0.3104, 0.0118, 0.2615],
[0.1509, 0.1745, 0.0558, 0.2522, 0.0131, 0.3535]], device='cuda:0')

and I would have expected this to be triangular.

MarthaL · May 3, 2024, 10:27pm

Update - printing out attention weights from Llama2 not hosted by HF, they are triangular.

ArthurZ · May 15, 2024, 6:53am

Hey! This is because by default we use sdpa, and the causal mask is not returned. I answered here GemmaForCausalLM Causal Masking Not Working · Issue #30813 · huggingface/transformers · GitHub

Topic		Replies	Views
Extract Attention Weights from a Specific Layer and Head Efficiently 🤗Transformers	1	124	March 25, 2025
How to separate Multi-head weight from q, k, v matrices? 🤗Transformers	0	19	December 22, 2024
Llama model outputs strange words Beginners	0	130	December 1, 2024
LLaMA2 - tokenizer padding affecting logits (even with attention_mask) 🤗Transformers	8	4540	March 26, 2024
Remove causal mask from Llama decoder Intermediate	5	699	October 22, 2024

Why are Llama2 attention weights not lower triangular?

Related topics