Attention mask shape (custom attention masking)

In e.g. the LlamaModel docs it suggests that the attention_mask passed to forward should be 2 dimensional.

However looking at the source code, it looks like it is possible to provide a 4D mask, and this will override the standard (e.g. causal) mask.

Is this correct? Should this be documented? Is there anything to watch out for when doing this? (I’m interested in providing a custom attention masking pattern to the Llama architecture).

1 Like

Hi,
yes I passed a custom mask in 4D and it worked :slight_smile: (confirmed with the attention scores)

Relating to your question - did you perhaps find out how to answer this question regarding additive/binary attention mask?

1 Like