Why do the value of logits change depending on whether samples are batched or not?

I’m noticing for models like BERT that the value of the model output logits change depending on how samples are batched & padded.

An example

import torch 
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased")

# arbitrarily add a padding token that should get ignored in SA by the `attention_mask`
a = "hey this is a thing"
b = "this is another thing but its wayyy longer"

inputs = tokenizer([a,b], padding = "max_length", max_length = len(tokenizer(b)['input_ids'])+1, return_tensors = 'pt')
logits = model(**inputs)['logits'] # torch.Size([2, 13, 30522])


# now individually get logits for each of the sequences 
logits1 = model(**tokenizer(a, return_tensors = 'pt'))['logits'] # torch.Size([1, 7, 30522])
logits2 = model(**tokenizer(b, return_tensors = 'pt'))['logits'] # torch.Size([1, 12, 30522])

# check if they agree (mind the slicing :p)
print(torch.allclose(logits[0][:logits1.shape[1]].unsqueeze(0), logits1))
print(torch.allclose(logits[1][:logits2.shape[1]].unsqueeze(0), logits2))

>> False
>> False 

Now if you just pad to the longest sequence and rerun everything you get

# .... 
inputs = tokenizer([a,b], padding = True, return_tensors = 'pt')

# same as above 

print(torch.allclose(logits[0][:logits1.shape[1]].unsqueeze(0), logits1))
print(torch.allclose(logits[1][:logits2.shape[1]].unsqueeze(0), logits2))

>> False
>> True 

Does anyone know why messing with how samples are padded or batched affects the actual model outputs when intuitively the attention_mask is supposed to ignore the padding tokens in SA?