Decoder Causal Masking [Keras]

ynaghibi · June 26, 2025, 1:52pm

The link shows Francois Chollet’s github implementation of the TransformerDecoder for a Seq2Seq model, which translates English phrases to Spanish ones.

The problem is that its call function of TransformerDecoder applies padding_mask to the key/value sequence returned from the English TransformerEncoder, even though padding_mask is based on the mask obtained from the Spanish embedding layer.
I would have expected that the padding mask of the English embedding layer would have been used here as well.

def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32"
            )
            padding_mask = tf.minimum(padding_mask, causal_mask)
        else:
            padding_mask = mask
        attention_output_1 = self.attention_1(
            query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
        )
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(
            query=attention_output_1, value=encoder_outputs, key=encoder_outputs, 
            attention_mask=padding_mask,
        )
        attention_output_2 = self.layernorm_2( attention_output_1 + attention_output_2 )
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)

Pimpcat-AU · June 27, 2025, 12:13am

It seems like the padding mask issue arises from the fact that the English padding mask is being overwritten by the Spanish mask. A solution could be to combine both masks instead of replacing one with the other. I would suggest trying this modification to make sure that both the English and Spanish padding masks are applied correctly.

def call(self, inputs, encoder_outputs, mask=None):
    causal_mask = self.get_causal_attention_mask(inputs)
    if mask is not None:
        padding_mask = tf.cast(mask, dtype="int32")
        combined_mask = tf.minimum(padding_mask, causal_mask)
    else:
        combined_mask = causal_mask

    attention_output_1 = self.self_attention_1(
        query=inputs, value=inputs, key=inputs, attention_mask=combined_mask
    )
    attention_output_2 = self.self_attention_2(
        inputs=attention_output_1
    )
    return self.layernorm_3(attention_output_2)

This should ensure both the English and Spanish masks are respected without overriding one another.

Solution provided by Triskel Data Deterministic Ai

ynaghibi · June 27, 2025, 4:42pm

Your code does not seem to be doing that. It also does not use any mask for the second attention layer.

I am more interested in the understanding the concept, so the solution does not have to be written with Keras. It can also be Pytorch or anything else. Finding detailed information about (causal) masks turned out to be surprisingly hard for me.

John6666 · June 28, 2025, 1:00am

Hmm, that’s difficult… Even if it’s not word-for-word, is it okay because padding is applied or so…?
https://ai.stackexchange.com/questions/42116/transformer-decoder-causal-masking-during-inference

ynaghibi · June 30, 2025, 6:11pm

Thanks. The blog you posted explains how causal masks work for self-attention, which is what I have already understood before. The question is how the mask looks like for the attention head that combines the encoder output with the decoder.
I have my guesses how this should work, but I would also like to see explicit explanations about this specific topic.

Topic		Replies	Views
Replace Causal Mask of T5 to custom mask Models	3	415	October 29, 2024
Where does causal mask get generated for T5 decoder? Beginners	2	660	January 9, 2024
Causal masks in BERT vs. GPT2 Intermediate	4	2720	December 30, 2022
Quick question on attention masking in transformer models Models	0	131	January 8, 2025
How to create encoder mask, decoder causal masks for batchsize >1 in Transformers 🤗Transformers	0	1268	July 21, 2023

Decoder Causal Masking [Keras]

Related topics