The (hidden) meaning behind the embedding of the padding token?

@KennethEnevoldsen I have been thinking about the same a while ago.
You have a point with different embeddings for pad tokens. But, to my understanding these never interfere with any part of model’s computation (like, self attention), since the pad tokens are always masked using the attention masks.
Would you have an example of where the pad token embeddings could make a difference, given the attention mask?

2 Likes