In e.g. the LlamaModel docs it suggests that the attention_mask passed to forward should be 2 dimensional.
However looking at the source code, it looks like it is possible to provide a 4D mask, and this will override the standard (e.g. causal) mask.
Is this correct? Should this be documented? Is there anything to watch out for when doing this? (I’m interested in providing a custom attention masking pattern to the Llama architecture).