In e.g. the LlamaModel docs it suggests that the attention_mask passed to forward should be 2 dimensional.
However looking at the source code, it looks like it is possible to provide a 4D mask, and this will override the standard (e.g. causal) mask.
Is this correct? Should this be documented? Is there anything to watch out for when doing this? (I’m interested in providing a custom attention masking pattern to the Llama architecture).
Hi @spranav1205,
so I am using LlamaForCausalML, here is my code snippet:
def generate_additive_attention_mask(no_masking_length, total_length):
# no_masking_length is the number of tokens you don't need masking for
mask = torch.tril(torch.ones(total_length, total_length)).to(device)
mask[:no_masking_length, :no_masking_length] = 1
mask = mask.unsqueeze(0).unsqueeze(0) # Add batch and num_attention_heads dimensions
mask = (1 - mask) * -1e9 # Because Llama2.7b-chat uses additive masking, set to -large number instead of 0, and set to 0 instead of 1.
inputs = tokenizer(msg)
seq_len = inputs["input_ids"].shape[1]
inputs["attention_mask"]=generate_additive_attention_mask(no_masking_length, seq_len)
outputs = model(**inputs)