Decoder Causal Masking [Keras]

Pimpcat-AU · June 27, 2025, 12:13am

It seems like the padding mask issue arises from the fact that the English padding mask is being overwritten by the Spanish mask. A solution could be to combine both masks instead of replacing one with the other. I would suggest trying this modification to make sure that both the English and Spanish padding masks are applied correctly.

def call(self, inputs, encoder_outputs, mask=None):
    causal_mask = self.get_causal_attention_mask(inputs)
    if mask is not None:
        padding_mask = tf.cast(mask, dtype="int32")
        combined_mask = tf.minimum(padding_mask, causal_mask)
    else:
        combined_mask = causal_mask

    attention_output_1 = self.self_attention_1(
        query=inputs, value=inputs, key=inputs, attention_mask=combined_mask
    )
    attention_output_2 = self.self_attention_2(
        inputs=attention_output_1
    )
    return self.layernorm_3(attention_output_2)

This should ensure both the English and Spanish masks are respected without overriding one another.

Solution provided by Triskel Data Deterministic Ai

Topic		Replies	Views
How to create encoder mask, decoder causal masks for batchsize >1 in Transformers 🤗Transformers	0	1278	July 21, 2023
Difference Between Attention Mask and Causal Mask 🤗Transformers	1	7245	September 2, 2024
Difference between transformer encoder and decoder Models	1	11841	March 12, 2021
Replace Causal Mask of T5 to custom mask Models	3	425	October 29, 2024
Different masks for encoder self and cross attention 🤗Transformers	0	1101	November 8, 2022

Decoder Causal Masking [Keras]

Related topics