So noticed that the transformers contain different embeddings for PAD tokens, and I know pad tokens typically are simply ignored for the most part (if at all present). However, as a forward pass using a batch typically contain dozens of padding tokens it would be interesting to see if these in fact hold any meaningful information (as padding tokens do attend to the sequence). Does anyone know of any research which has been conducted on what information might be present here?
One might legitimately ask why this is relevant isn’t padding tokens simply a convenience for efficient processing because we need the same tensor shape? This is naturally correct, but quite a few studies have clustered the sentence embedding and it seems relevant to ask what influence the padding embeddings have on this.
For a short demonstration that they indeed have different embeddings:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained(
"bert-base-uncased")
model = transformers.BertModel.from_pretrained(
"bert-base-uncased")
input_ = tokenizer(["this is a sample sentence"], return_tensors="pt",
# add some padding
padding="max_length", max_length=128, truncation=True)
output = model(**input_)
# extract padding token embedding
pad_tok_id = [i for i, t in enumerate(input_["input_ids"][0]) if t == 0]
embedding_pad1 = output[0][0][pad_tok_id[0]]
embedding_pad2 = output[0][0][pad_tok_id[1]]
embedding_pad1.shape #embedding size
embedding_pad1[0:10]
embedding_pad2[0:10]
tensor([-0.5072, -0.4916, -0.1021, -0.1485, -0.4096, 0.0536, -0.1111, 0.0525,
-0.0748, -0.4794], grad_fn=<SliceBackward>)
tensor([-0.6447, -0.5780, -0.1062, -0.1869, -0.3671, 0.0763, -0.0486, 0.0202,
-0.1334, -0.5716], grad_fn=<SliceBackward>)