Attention mask shape (custom attention masking)

alex-hh · August 27, 2024, 8:53pm

In e.g. the LlamaModel docs it suggests that the attention_mask passed to forward should be 2 dimensional.

However looking at the source code, it looks like it is possible to provide a 4D mask, and this will override the standard (e.g. causal) mask.

Is this correct? Should this be documented? Is there anything to watch out for when doing this? (I’m interested in providing a custom attention masking pattern to the Llama architecture).

AnneTheBoo · February 17, 2025, 3:38pm

Hi,
yes I passed a custom mask in 4D and it worked (confirmed with the attention scores)

Relating to your question - did you perhaps find out how to answer this question regarding additive/binary attention mask?

spranav1205 · April 5, 2025, 1:14pm

Hey,
I want to pass a custom mask, too. any chance you can help me out?
How did you pass it?
(Again, to override the standard causal mask)

Thanks in advance

AnneTheBoo · April 27, 2025, 10:46am

Hi @spranav1205,
so I am using LlamaForCausalML, here is my code snippet:

def generate_additive_attention_mask(no_masking_length, total_length): 
    # no_masking_length is the number of  tokens you don't need masking for
    mask = torch.tril(torch.ones(total_length, total_length)).to(device)
    mask[:no_masking_length, :no_masking_length] = 1
    mask = mask.unsqueeze(0).unsqueeze(0)  # Add batch and num_attention_heads dimensions
    mask = (1 - mask) * -1e9 # Because Llama2.7b-chat uses additive masking, set to -large number instead of 0, and set to 0 instead of 1.


inputs = tokenizer(msg)
seq_len = inputs["input_ids"].shape[1]
inputs["attention_mask"]=generate_additive_attention_mask(no_masking_length, seq_len) 
outputs = model(**inputs)

Topic		Replies	Views
Does Llama-2 use additive attention masking? 🤗Transformers	0	66	February 12, 2025
Quick question on attention masking in transformer models Models	0	126	January 8, 2025
Remove causal mask from Llama decoder Intermediate	5	708	October 22, 2024
SDPA attention in e.g. Llama does not use fused accelerations 🤗Transformers	0	827	March 5, 2024
Pass a custom mask when using RoBERTa 🤗Transformers	5	2307	January 10, 2023

Attention mask shape (custom attention masking)

Related topics