The (hidden) meaning behind the embedding of the padding token?

So noticed that the transformers contain different embeddings for PAD tokens, and I know pad tokens typically are simply ignored for the most part (if at all present). However, as a forward pass using a batch typically contain dozens of padding tokens it would be interesting to see if these in fact hold any meaningful information (as padding tokens do attend to the sequence). Does anyone know of any research which has been conducted on what information might be present here?

One might legitimately ask why this is relevant isn’t padding tokens simply a convenience for efficient processing because we need the same tensor shape? This is naturally correct, but quite a few studies have clustered the sentence embedding and it seems relevant to ask what influence the padding embeddings have on this.

For a short demonstration that they indeed have different embeddings:

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained(
    "bert-base-uncased")
model = transformers.BertModel.from_pretrained(
    "bert-base-uncased")

input_ = tokenizer(["this is a sample sentence"], return_tensors="pt",
                   # add some padding
                   padding="max_length", max_length=128, truncation=True)
output = model(**input_)

# extract padding token embedding
pad_tok_id = [i for i, t in enumerate(input_["input_ids"][0]) if t == 0]
embedding_pad1 = output[0][0][pad_tok_id[0]]
embedding_pad2 = output[0][0][pad_tok_id[1]]

embedding_pad1.shape #embedding size
embedding_pad1[0:10]
embedding_pad2[0:10]
tensor([-0.5072, -0.4916, -0.1021, -0.1485, -0.4096,  0.0536, -0.1111,  0.0525,
        -0.0748, -0.4794], grad_fn=<SliceBackward>)

tensor([-0.6447, -0.5780, -0.1062, -0.1869, -0.3671,  0.0763, -0.0486,  0.0202,
        -0.1334, -0.5716], grad_fn=<SliceBackward>)
1 Like

@KennethEnevoldsen I have been thinking about the same a while ago.
You have a point with different embeddings for pad tokens. But, to my understanding these never interfere with any part of model’s computation (like, self attention), since the pad tokens are always masked using the attention masks.
Would you have an example of where the pad token embeddings could make a difference, given the attention mask?

1 Like