I’m noticing for models like BERT that the value of the model output logits change depending on how samples are batched & padded.
An example
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased")
# arbitrarily add a padding token that should get ignored in SA by the `attention_mask`
a = "hey this is a thing"
b = "this is another thing but its wayyy longer"
inputs = tokenizer([a,b], padding = "max_length", max_length = len(tokenizer(b)['input_ids'])+1, return_tensors = 'pt')
logits = model(**inputs)['logits'] # torch.Size([2, 13, 30522])
# now individually get logits for each of the sequences
logits1 = model(**tokenizer(a, return_tensors = 'pt'))['logits'] # torch.Size([1, 7, 30522])
logits2 = model(**tokenizer(b, return_tensors = 'pt'))['logits'] # torch.Size([1, 12, 30522])
# check if they agree (mind the slicing :p)
print(torch.allclose(logits[0][:logits1.shape[1]].unsqueeze(0), logits1))
print(torch.allclose(logits[1][:logits2.shape[1]].unsqueeze(0), logits2))
>> False
>> False
Now if you just pad to the longest sequence and rerun everything you get
# ....
inputs = tokenizer([a,b], padding = True, return_tensors = 'pt')
# same as above
print(torch.allclose(logits[0][:logits1.shape[1]].unsqueeze(0), logits1))
print(torch.allclose(logits[1][:logits2.shape[1]].unsqueeze(0), logits2))
>> False
>> True
Does anyone know why messing with how samples are padded or batched affects the actual model outputs when intuitively the attention_mask
is supposed to ignore the padding tokens in SA?