Issue with LlamaSdpaAttention Not Being Utilized

jacobvinje · February 13, 2025, 12:16pm

Hello everyone,

I’m working with the Llama model from Hugging Face Transformers (v4.48.3) and noticed that it’s using LlamaAttention instead of LlamaSdpaAttention by default. This seems unexpected since my understanding is that the model should automatically use the SDPA kernel (torch.nn.functional.scaled_dot_product_attention) when possible.

Here’s my minimal reproduction:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
print(model)  # Shows LlamaAttention, not LlamaSdpaAttention

Even though I’m not requesting attention outputs or triggering any known conditions that would force a fallback to eager mode, the model still uses LlamaAttention. My environment:

Python 3.10.14
PyTorch 2.5.1
Transformers 4.48.3
CUDA 12.4.1

What determines whether LlamaSdpaAttention is used by default? Is there something specific about this version of Transformers or my setup that’s preventing automatic SDPA usage?

Also, when I try to set it to use SDPA manually, it still uses the normal/old LlamaAttention.
config._attn_implementation = “sdpa”
model = AutoModelForCausalLM.from_pretrained(model_name, config=config)

Thanks for any insights!

John6666 · February 13, 2025, 2:01pm

I’ve confirmed that the bug can be reproduced. Your usage and torch version should be correct…
From the code, it doesn’t look like there are any other suspicious branches.

github.com/huggingface/transformers

src/transformers/models/llama/modeling_llama.py

main

# coding=utf-8
# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
#
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
# and OPT implementations in this library. It has been modified from its
# original forms to accommodate minor architectural differences compared
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Callable, List, Optional, Tuple, Union

This file has been truncated. show original

Topic		Replies	Views
Hugging Face Llama-2 (7b) taking too much time while inferencing Models	1	1493	June 23, 2024
Loading a retrained model locally Beginners	2	2418	February 5, 2024
I can not download the llama2 model with transformers on gcp 🤗Transformers	0	159	April 19, 2024
Llama model outputs strange words Beginners	0	130	December 1, 2024
meta-llama/Llama-3.2-11B-Vision-Instruct did not reply 🤗Transformers	10	12904	October 29, 2024

Issue with LlamaSdpaAttention Not Being Utilized

Related topics