Issue with LlamaSdpaAttention Not Being Utilized

Hello everyone,

I’m working with the Llama model from Hugging Face Transformers (v4.48.3) and noticed that it’s using LlamaAttention instead of LlamaSdpaAttention by default. This seems unexpected since my understanding is that the model should automatically use the SDPA kernel (torch.nn.functional.scaled_dot_product_attention) when possible.

Here’s my minimal reproduction:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
print(model)  # Shows LlamaAttention, not LlamaSdpaAttention

Even though I’m not requesting attention outputs or triggering any known conditions that would force a fallback to eager mode, the model still uses LlamaAttention. My environment:

  • Python 3.10.14
  • PyTorch 2.5.1
  • Transformers 4.48.3
  • CUDA 12.4.1

What determines whether LlamaSdpaAttention is used by default? Is there something specific about this version of Transformers or my setup that’s preventing automatic SDPA usage?

Also, when I try to set it to use SDPA manually, it still uses the normal/old LlamaAttention.
config._attn_implementation = “sdpa”
model = AutoModelForCausalLM.from_pretrained(model_name, config=config)

Thanks for any insights!

1 Like

I’ve confirmed that the bug can be reproduced. Your usage and torch version should be correct…
From the code, it doesn’t look like there are any other suspicious branches.