Difference Between Attention Mask and Causal Mask

Alok23 · September 2, 2024, 8:54am

Attention Mask : Attention Masked is used in decoder-based transformers so that Unit has just pervious context, not the right side context.
I have observed the same behavior with causal mask, What is the difference between attention mask and causal mask used to mask look-ahead tokens in decode phase?

nielsr · September 2, 2024, 9:27am

Hi,

Both serve the same purpose, namely masking out (leaving out) tokens which shouldn’t participate in the attention computations.

The attention_mask is mainly used to mask out padding tokens, or other special tokens which one doesn’t want to include in the attention computations (padding tokens for instance are only used to ensure all sequences are of the same length so that sentences can be batched together for training). In the Transformers library, the tokenizer automatically creates the attention_mask for you and it’s another input to the model besides input_ids.
The causal mask serves the same purpose, but is only used by decoder-only (and also for the decoder part of encoder-decoder) models to ensure that the future is masked (the attention computation of a given token should not depend on tokens that come after it). This is to ensure models are trained to predict the next token, and no information gets “leaked”. In the Transformers library, this is all taken care of by the model itself, users don’t need to do anything regarding ensuring a causal mask is used.

Topic		Replies	Views
Quick question on attention masking in transformer models Models	0	125	January 8, 2025
Where does causal mask get generated for T5 decoder? Beginners	2	652	January 9, 2024
Clarification on the attention_mask 🤗Transformers	4	23444	May 3, 2024
Role of attention mask in base Bert 🤗Transformers	0	329	December 22, 2022
Difference between transformer encoder and decoder Models	1	11787	March 12, 2021

Difference Between Attention Mask and Causal Mask

Related topics