Difference Between Attention Mask and Causal Mask

Attention Mask : Attention Masked is used in decoder-based transformers so that Unit has just pervious context, not the right side context.
I have observed the same behavior with causal mask, What is the difference between attention mask and causal mask used to mask look-ahead tokens in decode phase?

Hi,

Both serve the same purpose, namely masking out (leaving out) tokens which shouldn’t participate in the attention computations.

  • The attention_mask is mainly used to mask out padding tokens, or other special tokens which one doesn’t want to include in the attention computations (padding tokens for instance are only used to ensure all sequences are of the same length so that sentences can be batched together for training). In the Transformers library, the tokenizer automatically creates the attention_mask for you and it’s another input to the model besides input_ids.
  • The causal mask serves the same purpose, but is only used by decoder-only (and also for the decoder part of encoder-decoder) models to ensure that the future is masked (the attention computation of a given token should not depend on tokens that come after it). This is to ensure models are trained to predict the next token, and no information gets “leaked”. In the Transformers library, this is all taken care of by the model itself, users don’t need to do anything regarding ensuring a causal mask is used.