I am training a NLP model in an autoregressive matter via transformers. During the training we need to use target mask that basically hide the future works and prevent the attention from cheating.
In the available implementations I see that in the decoder part all the layers of the decoder use that target mask.
My question is why do we need to use the mask in all the layers and not only the first one. Is there something that I am missing?
Basically if we mask the target in the first layer the model will not see that in the other decoder layers, so what would prevent us from doing it?