Inconsistency in logit values between generation and direct model prediction #31127

Hi, I’m facing an inconsistency when it comes to generating text with generate method and then getting the logits for the same text from a logits obtained from a model.

import torch
import os
from transformers import AutoTokenizer, GPTNeoXForCausalLM

os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"

torch.set_printoptions(precision=64)

# Set the random seed and deterministic algorithms for reproducibility
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
torch.use_deterministic_algorithms(True)

# Initialize tokenizer and model, set to evaluation mode, and move to CPU for reproducibility
tokenizer = AutoTokenizer.from_pretrained(f"EleutherAI/pythia-70m")
model = GPTNeoXForCausalLM.from_pretrained(f"EleutherAI/pythia-70m")

device = torch.device("cuda")

model.eval()
model.to(device)
model.double()

# Encode the input text
input_text = tokenizer.eos_token
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

# Generate sequences using the model
generate_outputs = model.generate(
    input_ids,
    max_length=len(input_ids[0]) + 5,
    do_sample=True,
    return_dict_in_generate=True,
    output_logits=True,
)

# Decode the generated sequence
generated_ids = generate_outputs.sequences
generated_text = tokenizer.decode(generated_ids[0])

# Re-encode and evaluate the generated text
inputs = tokenizer(generated_text, return_tensors="pt").input_ids.to(device)
outputs = model(inputs)

# Extract logits
# Note: Adjust indices based on where the new tokens start in the generated sequence
first_generated_token_logits_from_generate = generate_outputs.logits[1].squeeze()
first_generated_token_logits_from_model = outputs.logits[0, len(input_ids[0]):len(input_ids[0])+1].squeeze()

# Print out the logits for comparison
print("Logits from generate method for the first generated token:")
print(first_generated_token_logits_from_generate)
print("\nLogits from model for the first generated token:")
print(first_generated_token_logits_from_model)

Without calling model.double() (comment out the line):

Logits from generate method for the first generated token:
tensor([1071.1469726562500000000000000000000000000000000000000000000000000000,
         230.7740783691406250000000000000000000000000000000000000000000000000,
        1070.1348876953125000000000000000000000000000000000000000000000000000,
         ...,
         230.7809600830078125000000000000000000000000000000000000000000000000,
         230.7765350341796875000000000000000000000000000000000000000000000000,
         230.7785034179687500000000000000000000000000000000000000000000000000],
       device='cuda:0')

Logits from model for the first generated token:
tensor([1071.1472167968750000000000000000000000000000000000000000000000000000,
         230.7741088867187500000000000000000000000000000000000000000000000000,
        1070.1352539062500000000000000000000000000000000000000000000000000000,
         ...,
         230.7809143066406250000000000000000000000000000000000000000000000000,
         230.7764892578125000000000000000000000000000000000000000000000000000,
         230.7785339355468750000000000000000000000000000000000000000000000000],
       device='cuda:0', grad_fn=<SqueezeBackward0>)

With calling model.double():

Logits from generate method for the first generated token:
tensor([1071.1467351228166080545634031295776367187500000000000000000000000000,
         230.7740661093434084705222630873322486877441406250000000000000000000,
        1070.1347990279657551582204177975654602050781250000000000000000000000,
         ...,
         230.7809335116805300458509009331464767456054687500000000000000000000,
         230.7765007713145166690082987770438194274902343750000000000000000000,
         230.7784833010678937625925755128264427185058593750000000000000000000],
       device='cuda:0', dtype=torch.float64)

Logits from model for the first generated token:
tensor([1071.1467351228168354282388463616371154785156250000000000000000000000,
         230.7740661093434368922316934913396835327148437500000000000000000000,
        1070.1347990279655277845449745655059814453125000000000000000000000000,
         ...,
         230.7809335116805016241414705291390419006347656250000000000000000000,
         230.7765007713144882472988683730363845825195312500000000000000000000,
         230.7784833010680358711397275328636169433593750000000000000000000000],
       device='cuda:0', dtype=torch.float64, grad_fn=<SqueezeBackward0>)

As you can see, even after calling model.double(), the logits differ slightly. The expected behaviour behavior is that logits from the generate method and direct model predictions should be identical. I have not tried this with other models, but I belive this is not a model-specific issue.