Computing Log-Probabilities in Two Different Ways

When trying to compute the log-probabilities of the next token in two different ways using a decoder-only transformer – that is, model.__call__ and model.generate(..., output_scores=True), they don’t seem to be equal. Here is a MWE:

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", device_map="cpu", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
prompt = "I don't know that"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_values = model.generate(
                    output_scores=True, return_dict_in_generate=True, do_sample=True,
                    max_new_tokens=1, min_new_tokens=1,
gen_seq = gen_values.sequences
gen_scores = gen_values.scores[0]
scores = model(gen_seq, labels=gen_seq).logits[:, -1, :]
print((F.softmax(gen_scores, dim=-1) - F.softmax(scores, dim=-1)).abs().sum())  # gets 1-2 consistently

Could you please let me know if I’m making a mistake in computing the log-probabilities, and if so, where?