Hello everyone,
I’m currently delving into understanding the behavior of a model during the generation process using model.generate()
. Specifically, I’ve been experimenting with generating text based on a given prompt. Here’s a simplified breakdown of my experiment:
- I start with a prompt:
"i love eating pepperoni pizza because"
- I aim to generate additional text based on this prompt using two different methods:
- Method 1: Generating 16 tokens at once using
model.generate(..., max_new_tokens=16, ...)
. - Method 2: Generating tokens one by one, where each new token is generated by appending it to the previous input and generating again using
model.generate(..., max_new_tokens=1, ...)
.
Despite my initial expectation that these methods would yield exactly the same results, the results appear to be different. The logits and newly generated tokens diverge significantly between the two methods.
For a more detailed look, here’s a snippet of the Python code I’m using (you could directly copy and run ):
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
)
from transformers.trainer_utils import set_seed
import numpy as np
set_seed(42)
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = AutoModelForCausalLM.from_pretrained("gpt2").to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id
def generate_response(input_ids, attention_mask, max_new_tokens=16):
with torch.no_grad():
generation_output = model.generate(
input_ids=torch.tensor([input_ids]).to(DEVICE),
attention_mask=torch.tensor([attention_mask]).to(DEVICE),
output_scores=True,
return_dict_in_generate=True,
do_sample=True,
top_p=1,
num_return_sequences=1,
max_new_tokens=max_new_tokens,
)
return generation_output
prompt = "i love eating pepperoni pizza because"
tokenized_input = tokenizer(prompt)
input_ids = tokenized_input["input_ids"]
attention_mask = tokenized_input["attention_mask"]
# METHOD 1: generating 16 tokens at once
output1 = generate_response(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=16,
)
token_ids1 = output1.sequences[0][len(input_ids):]
logits1 = np.array([output1.scores[i][0][token_id].detach().cpu().numpy() for i, token_id in enumerate(token_ids1)])
print(token_ids1)
print(logits1)
# tensor([ 329, 262, 749, 636, 340, 338, 1365, 621, 1642, 257,
# 20698, 13, 1649, 314, 373, 3957], device='cuda:0')
# [-104.20506 -100.19312 -102.920456 -128.92113 -81.56552 -98.53663
# -117.57404 -101.84916 -105.955185 -92.83515 -109.136475 -114.38927
# -149.21672 -91.5892 -157.68016 -135.1889 ]
# METHOD 2: generating token after token
token_ids2 = list()
logits2 = list()
for i in range(16):
temp_output = generate_response(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=1,
)
new_token = temp_output.sequences[0][-1].detach().cpu().numpy().item()
new_logit = temp_output.scores[0][0][new_token].detach().cpu().numpy()
input_ids.append(new_token)
attention_mask.append(1)
token_ids2.append(new_token)
logits2.append(new_logit.item())
print(token_ids2)
print(logits2)
# [673, 338, 1016, 284, 923, 6600, 13385, 14651, 477, 607, 1204, 11, 475, 428, 614, 30]
# [-101.92338562011719, -131.71400451660156, -123.15164947509766, -76.13870239257812, -134.41748046875,
# -119.96012115478516, -101.49273681640625, -64.4681396484375, -105.8199691772461, -82.54948425292969,
# -78.48075866699219, -85.4245376586914, -109.86181640625, -131.30165100097656, -92.7171630859375, -98.04169464111328]
I’d greatly appreciate any insights or explanations regarding this discrepancy. Am I misunderstanding something fundamental about the generation process, or could there be some nuances at play here that I haven’t considered?
Looking forward to your thoughts and suggestions!