Decoder Causal Masking [Keras]

Thanks. The blog you posted explains how causal masks work for self-attention, which is what I have already understood before. The question is how the mask looks like for the attention head that combines the encoder output with the decoder.
I have my guesses how this should work, but I would also like to see explicit explanations about this specific topic.

1 Like