Why are Llama2 attention weights not lower triangular?

Apologies if this is a stupid question. I want to look at attention in Llama2. A MWE is below:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

model_inputs = tokenizer.encode("Cats chase dogs", return_tensors="pt").to("cuda:0")

output = model.generate(model_inputs, output_attentions=True, max_new_tokens=5, return_dict_in_generate=True)

print(output.attentions[0][0][0][0]) # prints attention weights of first head in first layer

output is:
tensor([[0.3464, 0.1967, 0.0438, 0.1245, 0.0281, 0.2606],
[0.3714, 0.3745, 0.0209, 0.1465, 0.0157, 0.0710],
[0.1448, 0.4541, 0.0274, 0.2500, 0.0149, 0.1087],
[0.2242, 0.3160, 0.0371, 0.2656, 0.0157, 0.1414],
[0.1242, 0.2456, 0.0464, 0.3104, 0.0118, 0.2615],
[0.1509, 0.1745, 0.0558, 0.2522, 0.0131, 0.3535]], device='cuda:0')

and I would have expected this to be triangular.

Update - printing out attention weights from Llama2 not hosted by HF, they are triangular.

Hey! This is because by default we use sdpa, and the causal mask is not returned. I answered here GemmaForCausalLM Causal Masking Not Working · Issue #30813 · huggingface/transformers · GitHub