At first stage of BartDecoder, we compute
- compute token embedding
- add positional embedding
- layer normalization
- dropout (optional)
x = self.embed_tokens(input_ids)
x += positions
x = self.layernorm(x)
x = dropout(x, p, self.training)
I am thinking of moving dropout right before adding positional embedding for making token embedding noisy
x = self.embed_tokens(input_ids)
x = dropout(x, p, self.training)
x += positions
x = self.layernorm(x)
Is there any belief that dropout needs to be placed after layer normalization?