Why is perplexity calculation giving different results for the same input?

I’m following Huggingface doc on calculating the perplexity of fixed-length models. I’m trying to verify that the formula works for various strings and I’m getting odd behavior. In particular, they mention:

We don’t want the log-likelihood for the tokens we’re just treating as context to be included in our loss, so we can set these targets to -100 so that they are ignored

So given 2 different contexts but the same remaining tokens, the formula should return the same perplexity. However, it does not:

from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

context_1 = 'here is some context_1 and some more stuff'
context_2 = 'here is some context and some more stuff and more stuff aspodkaspd'
answer_1 = 'this is not the answer'

input_ids_wrong = tokenizer(context_1 + answer_1, return_tensors="pt").input_ids
input_ids_correct = tokenizer(context_2 + answer_1, return_tensors="pt").input_ids
context_1_tokens_length = len(tokenizer(context_1, return_tensors="pt").input_ids[0])
context_2_tokens_length = len(tokenizer(context_2, return_tensors="pt").input_ids[0])

target_ids_wrong = input_ids_wrong.clone()
target_ids_correct = input_ids_correct.clone()

target_ids_wrong[:, :context_1_tokens_length] = -100 
target_ids_correct[:, :context_2_tokens_length] = -100 

print('target_ids_wrong', target_ids_wrong)
print('target_ids_correct', target_ids_correct)

with torch.no_grad():
    outputs_wrong = model(input_ids_wrong, labels=target_ids_wrong)
    outputs_correct = model(input_ids_correct, labels=target_ids_correct)
    
    neg_log_likelihood_wrong = outputs_wrong.loss
    neg_log_likelihood_correct = outputs_correct.loss

    ppl_wrong = torch.exp(neg_log_likelihood_wrong)
    ppl_correct = torch.exp(neg_log_likelihood_correct)
    print('ppl_wrong', ppl_wrong)
    print('ppl_correct', ppl_correct)

Output:

    target_ids_wrong tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,   19,
               59,    8, 1525,    1]])
    target_ids_correct tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
             -100, -100, -100, -100, -100, -100,   19,   59,    8, 1525,    1]])
    ppl_wrong tensor(9.0377)
    ppl_correct tensor(21.1208)