Quick question on attention masking in transformer models

sramakintel · January 8, 2025, 4:14pm

I have been trying to understand if both causal and attention masks are required during text generation inference.

Here is my reasoning.

Considering Llama code’s forward function. during training and inference, the _update_causal_mask function is called.

I am using sdpa attention and Dynamic Cache and the _ignore_causal_mask_sdpa function is in turn invoked.

During inference, this condition satisfies for all the sequentially generated tokens and hence the causal mask is always None. However, in training, this is not the case and so a causal mask does get created.

However, it is this causal mask that is passed as attention mask downstream.

This means for batch inference, the mask would still be None.

Questions:

Does this mean causal mask will always be None for inference?
How are the attention masks used during batched inference?

Insights into these will be very helpful! Thanks

Topic		Replies	Views
Does the transformers Trainer.train() automatically set positional attention masks? 🤗Transformers	4	517	June 3, 2024
Difference Between Attention Mask and Causal Mask 🤗Transformers	1	6844	September 2, 2024
Attention mask shape (custom attention masking) 🤗Transformers	3	765	April 27, 2025
Forward-looking or left-context attention mask (left-to-right) generation with BertGeneration and RobertaForCausalLM 🤗Transformers	3	1353	October 27, 2020
Is it okay to use CausalLM with zero attention values? Models	0	94	June 4, 2024

Quick question on attention masking in transformer models

Related topics