Hi,
I need to build a custom causal attention mask for meta-llama/Llama-2-7b-chat-hf
from LlamaForCausalML
.
Can someone confirm that it uses additive masking (0 or large negative number close to -inf) instead of binary masking (0 or 1) for its attention mechanism?
I found that binary masking doesn’t work when I investigated the attention scores, whereas additive masking actually lead to attention scores being 0 in the right dimensions. But on the other hand the documentation says it uses binary masking (Llama2). And the standard attention mask from the tokenizer only contains 1-values, which concerns me.
Thanks in advance