Does Llama-2 use additive attention masking?

Hi,
I need to build a custom causal attention mask for meta-llama/Llama-2-7b-chat-hf from LlamaForCausalML.

Can someone confirm that it uses additive masking (0 or large negative number close to -inf) instead of binary masking (0 or 1) for its attention mechanism?

I found that binary masking doesn’t work when I investigated the attention scores, whereas additive masking actually lead to attention scores being 0 in the right dimensions. But on the other hand the documentation says it uses binary masking (Llama2). And the standard attention mask from the tokenizer only contains 1-values, which concerns me.

Thanks in advance

2 Likes