The (hidden) meaning behind the embedding of the padding token?

So noticed that the transformers contain different embeddings for PAD tokens, and I know pad tokens typically are simply ignored for the most part (if at all present). However, as a forward pass using a batch typically contain dozens of padding tokens it would be interesting to see if these in fact hold any meaningful information (as padding tokens do attend to the sequence). Does anyone know of any research which has been conducted on what information might be present here?

One might legitimately ask why this is relevant isn’t padding tokens simply a convenience for efficient processing because we need the same tensor shape? This is naturally correct, but quite a few studies have clustered the sentence embedding and it seems relevant to ask what influence the padding embeddings have on this.

For a short demonstration that they indeed have different embeddings:

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained(
    "bert-base-uncased")
model = transformers.BertModel.from_pretrained(
    "bert-base-uncased")

input_ = tokenizer(["this is a sample sentence"], return_tensors="pt",
                   # add some padding
                   padding="max_length", max_length=128, truncation=True)
output = model(**input_)

# extract padding token embedding
pad_tok_id = [i for i, t in enumerate(input_["input_ids"][0]) if t == 0]
embedding_pad1 = output[0][0][pad_tok_id[0]]
embedding_pad2 = output[0][0][pad_tok_id[1]]

embedding_pad1.shape #embedding size
embedding_pad1[0:10]
embedding_pad2[0:10]
tensor([-0.5072, -0.4916, -0.1021, -0.1485, -0.4096,  0.0536, -0.1111,  0.0525,
        -0.0748, -0.4794], grad_fn=<SliceBackward>)

tensor([-0.6447, -0.5780, -0.1062, -0.1869, -0.3671,  0.0763, -0.0486,  0.0202,
        -0.1334, -0.5716], grad_fn=<SliceBackward>)
1 Like

@KennethEnevoldsen I have been thinking about the same a while ago.
You have a point with different embeddings for pad tokens. But, to my understanding these never interfere with any part of model’s computation (like, self attention), since the pad tokens are always masked using the attention masks.
Would you have an example of where the pad token embeddings could make a difference, given the attention mask?

1 Like

Hello,

This discussion sounds interesting to me because I was thinking the same.
Why there are different embedding vectors for PAD tokens.

My use-case is a multi-label text classification where I am using a pretrained model in MaskedLanguageModeling as an “embedding layer”. More specific, I feed the input text [b,t] padded to the “embedding layer” and it outputs [b,t,f], where b is the batch_size, t is the length of the max sequence in the batch, f is the feature_number.

After this I am using Attention to [b,t,f] and take a vector [b,1,f] which, after pass it from two linear layers and a sigmoid, gives the predictions.

I check cosine similarity between embedding vectors of PAD tokens and it is almost between all over 0.7. Additionally, cosine similarity between words’ embedding vectors and PAD tokens’ vectors is almost between all under 0.3.

Atenttion mechanism seems to assign negligible weights to PAD tokens embeddings vectors.

In general it seems that these vectors are kind of ignored from the model. Furthermore, my results are pretty ok with respect to accuracy.