The link shows Francois Chollet’s github implementation of the TransformerDecoder for a Seq2Seq model, which translates English phrases to Spanish ones.
The problem is that its call function of TransformerDecoder applies padding_mask to the key/value sequence returned from the English TransformerEncoder, even though padding_mask is based on the mask obtained from the Spanish embedding layer.
I would have expected that the padding mask of the English embedding layer would have been used here as well.
def call(self, inputs, encoder_outputs, mask=None):
causal_mask = self.get_causal_attention_mask(inputs)
if mask is not None:
padding_mask = tf.cast(
mask[:, tf.newaxis, :], dtype="int32"
)
padding_mask = tf.minimum(padding_mask, causal_mask)
else:
padding_mask = mask
attention_output_1 = self.attention_1(
query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
)
attention_output_1 = self.layernorm_1(inputs + attention_output_1)
attention_output_2 = self.attention_2(
query=attention_output_1, value=encoder_outputs, key=encoder_outputs,
attention_mask=padding_mask,
)
attention_output_2 = self.layernorm_2( attention_output_1 + attention_output_2 )
proj_output = self.dense_proj(attention_output_2)
return self.layernorm_3(attention_output_2 + proj_output)