Decoder Causal Masking [Keras]

Link

The link shows Francois Chollet’s github implementation of the TransformerDecoder for a Seq2Seq model, which translates English phrases to Spanish ones.

The problem is that its call function of TransformerDecoder applies padding_mask to the key/value sequence returned from the English TransformerEncoder, even though padding_mask is based on the mask obtained from the Spanish embedding layer.
I would have expected that the padding mask of the English embedding layer would have been used here as well.

def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32"
            )
            padding_mask = tf.minimum(padding_mask, causal_mask)
        else:
            padding_mask = mask
        attention_output_1 = self.attention_1(
            query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
        )
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(
            query=attention_output_1, value=encoder_outputs, key=encoder_outputs, 
            attention_mask=padding_mask,
        )
        attention_output_2 = self.layernorm_2( attention_output_1 + attention_output_2 )
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)
1 Like

It seems like the padding mask issue arises from the fact that the English padding mask is being overwritten by the Spanish mask. A solution could be to combine both masks instead of replacing one with the other. I would suggest trying this modification to make sure that both the English and Spanish padding masks are applied correctly.

def call(self, inputs, encoder_outputs, mask=None):
    causal_mask = self.get_causal_attention_mask(inputs)
    if mask is not None:
        padding_mask = tf.cast(mask, dtype="int32")
        combined_mask = tf.minimum(padding_mask, causal_mask)
    else:
        combined_mask = causal_mask

    attention_output_1 = self.self_attention_1(
        query=inputs, value=inputs, key=inputs, attention_mask=combined_mask
    )
    attention_output_2 = self.self_attention_2(
        inputs=attention_output_1
    )
    return self.layernorm_3(attention_output_2)

This should ensure both the English and Spanish masks are respected without overriding one another.

Solution provided by Triskel Data Deterministic Ai

1 Like

Your code does not seem to be doing that. It also does not use any mask for the second attention layer.

I am more interested in the understanding the concept, so the solution does not have to be written with Keras. It can also be Pytorch or anything else. Finding detailed information about (causal) masks turned out to be surprisingly hard for me.

1 Like

Hmm, that’s difficult… Even if it’s not word-for-word, is it okay because padding is applied or so…?:thinking:
https://ai.stackexchange.com/questions/42116/transformer-decoder-causal-masking-during-inference