The (hidden) meaning behind the embedding of the padding token?

NikitaSoni · April 29, 2021, 1:21am

@KennethEnevoldsen I have been thinking about the same a while ago.
You have a point with different embeddings for pad tokens. But, to my understanding these never interfere with any part of model’s computation (like, self attention), since the pad tokens are always masked using the attention masks.
Would you have an example of where the pad token embeddings could make a difference, given the attention mask?

Topic		Replies	Views
Same PAD Position but Different PAD Embedding 🤗Transformers	1	432	March 23, 2021
BERT embeddings for padding token not 0? Models	4	4163	February 14, 2022
Why can padding tokens attend to other tokens in masked self attention? 🤗Transformers	0	78	November 4, 2024
Attention mask and token ids Awesome paper	1	2270	October 18, 2022
Bert output for padding tokens Beginners	3	3304	February 22, 2023

The (hidden) meaning behind the embedding of the padding token?

Related topics