Generating Once for 16 Tokens is Not Same Generating Single Token 16 Times?

Hello everyone,

I’m currently delving into understanding the behavior of a model during the generation process using model.generate(). Specifically, I’ve been experimenting with generating text based on a given prompt. Here’s a simplified breakdown of my experiment:

  1. I start with a prompt: "i love eating pepperoni pizza because"
  2. I aim to generate additional text based on this prompt using two different methods:
  • Method 1: Generating 16 tokens at once using model.generate(..., max_new_tokens=16, ...).
  • Method 2: Generating tokens one by one, where each new token is generated by appending it to the previous input and generating again using model.generate(..., max_new_tokens=1, ...).

Despite my initial expectation that these methods would yield exactly the same results, the results appear to be different. The logits and newly generated tokens diverge significantly between the two methods.

For a more detailed look, here’s a snippet of the Python code I’m using (you could directly copy and run ):

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
)

from transformers.trainer_utils import set_seed

import numpy as np

set_seed(42)

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = AutoModelForCausalLM.from_pretrained("gpt2").to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id


def generate_response(input_ids, attention_mask, max_new_tokens=16):
    with torch.no_grad():
        generation_output = model.generate(
            input_ids=torch.tensor([input_ids]).to(DEVICE),
            attention_mask=torch.tensor([attention_mask]).to(DEVICE),
            output_scores=True,
            return_dict_in_generate=True,
            do_sample=True,
            top_p=1,
            num_return_sequences=1,
            max_new_tokens=max_new_tokens,
        )

    return generation_output

prompt = "i love eating pepperoni pizza because"
tokenized_input = tokenizer(prompt)

input_ids = tokenized_input["input_ids"]
attention_mask = tokenized_input["attention_mask"]

# METHOD 1: generating 16 tokens at once
output1 = generate_response(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_new_tokens=16,
)

token_ids1 = output1.sequences[0][len(input_ids):]
logits1 = np.array([output1.scores[i][0][token_id].detach().cpu().numpy() for i, token_id in enumerate(token_ids1)])

print(token_ids1)
print(logits1)

# tensor([  329,   262,   749,   636,   340,   338,  1365,   621,  1642,   257,
#         20698,    13,  1649,   314,   373,  3957], device='cuda:0')
# [-104.20506  -100.19312  -102.920456 -128.92113   -81.56552   -98.53663
#  -117.57404  -101.84916  -105.955185  -92.83515  -109.136475 -114.38927
#  -149.21672   -91.5892   -157.68016  -135.1889  ]


# METHOD 2: generating token after token
token_ids2 = list()
logits2 = list()
for i in range(16):
    temp_output = generate_response(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=1,
    )

    new_token = temp_output.sequences[0][-1].detach().cpu().numpy().item()
    new_logit = temp_output.scores[0][0][new_token].detach().cpu().numpy()

    input_ids.append(new_token)
    attention_mask.append(1)

    token_ids2.append(new_token)
    logits2.append(new_logit.item())

print(token_ids2)
print(logits2)

# [673, 338, 1016, 284, 923, 6600, 13385, 14651, 477, 607, 1204, 11, 475, 428, 614, 30]
# [-101.92338562011719, -131.71400451660156, -123.15164947509766, -76.13870239257812, -134.41748046875, 
#  -119.96012115478516, -101.49273681640625, -64.4681396484375, -105.8199691772461, -82.54948425292969, 
#  -78.48075866699219, -85.4245376586914, -109.86181640625, -131.30165100097656, -92.7171630859375, -98.04169464111328]

I’d greatly appreciate any insights or explanations regarding this discrepancy. Am I misunderstanding something fundamental about the generation process, or could there be some nuances at play here that I haven’t considered?

Looking forward to your thoughts and suggestions!

Hi! You have to disable sampling to get exactly same results, in other words the generation should return the token id with maximum probability.

1 Like

I once met similar problems as @guanqun-yang, thank you for providing the MWE!

@RaushanTurganbay Can you elaborate why the sampling has to be disabled? Is this related to the seed: ‘np.random.randn(16)’ is not same as ‘np.random.randn()’ for 16 times?

@MrRobot Sampling means that the next token is sampled from logits distribution, instead of getting the one with maximum probability. For example the below script will print different tokens every time we sample.

dummy_logits_normalized = torch.randn(10, 100).softmax(axis=-1)
for _ in range(10):
    sampled_tokens = torch.multinomial(dummy_logits_normalized, num_samples=1)
    print(sampled_tokens)

Some dicsussion on the torch.multinomial can be found here

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.