SDPA attention in e.g. Llama does not use fused accelerations

Ururu · March 5, 2024, 6:12am

Hi, I’m worried the attention implementation that does rely on pytorch’s SCALED_DOT_PRODUCT_ATTENTION does not use it’s full potential.

Indeed, the function call at line 673 of modeling_llama.py does not use the argument ‘is_causal’ which allows for fused implementations (Accelerated PyTorch 2 Transformers | PyTorch):

" At present, the only attention mask supported by fused kernel implementation is the causal mask commonly used for training. To specify the causal mask in custom kernels, it must be specified with the is_causal boolean and attn_mask must be None".

Hope this can help,
Anthony.

Topic		Replies	Views
Quick question on attention masking in transformer models Models	0	125	January 8, 2025
Attention mask shape (custom attention masking) 🤗Transformers	3	697	April 27, 2025
Does Llama-2 use additive attention masking? 🤗Transformers	0	63	February 12, 2025
Remove causal mask from Llama decoder Intermediate	5	699	October 22, 2024
Is it okay to use CausalLM with zero attention values? Models	0	94	June 4, 2024

SDPA attention in e.g. Llama does not use fused accelerations

Related topics