Decoder Causal Masking [Keras]

Your code does not seem to be doing that. It also does not use any mask for the second attention layer.

I am more interested in the understanding the concept, so the solution does not have to be written with Keras. It can also be Pytorch or anything else. Finding detailed information about (causal) masks turned out to be surprisingly hard for me.

1 Like