Decoder Causal Masking [Keras]

ynaghibi · June 30, 2025, 6:11pm

Thanks. The blog you posted explains how causal masks work for self-attention, which is what I have already understood before. The question is how the mask looks like for the attention head that combines the encoder output with the decoder.
I have my guesses how this should work, but I would also like to see explicit explanations about this specific topic.

Topic		Replies	Views
How to create encoder mask, decoder causal masks for batchsize >1 in Transformers 🤗Transformers	0	1279	July 21, 2023
Difference Between Attention Mask and Causal Mask 🤗Transformers	1	7578	September 2, 2024
Difference between transformer encoder and decoder Models	1	11855	March 12, 2021
Replace Causal Mask of T5 to custom mask Models	3	428	October 29, 2024
Different masks for encoder self and cross attention 🤗Transformers	0	1102	November 8, 2022

Decoder Causal Masking [Keras]

Related topics