Generating Once for 16 Tokens is Not Same Generating Single Token 16 Times?

guanqun-yang · April 17, 2024, 5:35am

Hello everyone,

I’m currently delving into understanding the behavior of a model during the generation process using model.generate(). Specifically, I’ve been experimenting with generating text based on a given prompt. Here’s a simplified breakdown of my experiment:

I start with a prompt: "i love eating pepperoni pizza because"
I aim to generate additional text based on this prompt using two different methods:

Method 1: Generating 16 tokens at once using model.generate(..., max_new_tokens=16, ...).
Method 2: Generating tokens one by one, where each new token is generated by appending it to the previous input and generating again using model.generate(..., max_new_tokens=1, ...).

Despite my initial expectation that these methods would yield exactly the same results, the results appear to be different. The logits and newly generated tokens diverge significantly between the two methods.

For a more detailed look, here’s a snippet of the Python code I’m using (you could directly copy and run ):

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
)

from transformers.trainer_utils import set_seed

import numpy as np

set_seed(42)

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = AutoModelForCausalLM.from_pretrained("gpt2").to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id


def generate_response(input_ids, attention_mask, max_new_tokens=16):
    with torch.no_grad():
        generation_output = model.generate(
            input_ids=torch.tensor([input_ids]).to(DEVICE),
            attention_mask=torch.tensor([attention_mask]).to(DEVICE),
            output_scores=True,
            return_dict_in_generate=True,
            do_sample=True,
            top_p=1,
            num_return_sequences=1,
            max_new_tokens=max_new_tokens,
        )

    return generation_output

prompt = "i love eating pepperoni pizza because"
tokenized_input = tokenizer(prompt)

input_ids = tokenized_input["input_ids"]
attention_mask = tokenized_input["attention_mask"]

# METHOD 1: generating 16 tokens at once
output1 = generate_response(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_new_tokens=16,
)

token_ids1 = output1.sequences[0][len(input_ids):]
logits1 = np.array([output1.scores[i][0][token_id].detach().cpu().numpy() for i, token_id in enumerate(token_ids1)])

print(token_ids1)
print(logits1)

# tensor([  329,   262,   749,   636,   340,   338,  1365,   621,  1642,   257,
#         20698,    13,  1649,   314,   373,  3957], device='cuda:0')
# [-104.20506  -100.19312  -102.920456 -128.92113   -81.56552   -98.53663
#  -117.57404  -101.84916  -105.955185  -92.83515  -109.136475 -114.38927
#  -149.21672   -91.5892   -157.68016  -135.1889  ]


# METHOD 2: generating token after token
token_ids2 = list()
logits2 = list()
for i in range(16):
    temp_output = generate_response(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=1,
    )

    new_token = temp_output.sequences[0][-1].detach().cpu().numpy().item()
    new_logit = temp_output.scores[0][0][new_token].detach().cpu().numpy()

    input_ids.append(new_token)
    attention_mask.append(1)

    token_ids2.append(new_token)
    logits2.append(new_logit.item())

print(token_ids2)
print(logits2)

# [673, 338, 1016, 284, 923, 6600, 13385, 14651, 477, 607, 1204, 11, 475, 428, 614, 30]
# [-101.92338562011719, -131.71400451660156, -123.15164947509766, -76.13870239257812, -134.41748046875, 
#  -119.96012115478516, -101.49273681640625, -64.4681396484375, -105.8199691772461, -82.54948425292969, 
#  -78.48075866699219, -85.4245376586914, -109.86181640625, -131.30165100097656, -92.7171630859375, -98.04169464111328]

I’d greatly appreciate any insights or explanations regarding this discrepancy. Am I misunderstanding something fundamental about the generation process, or could there be some nuances at play here that I haven’t considered?

Looking forward to your thoughts and suggestions!

RaushanTurganbay · April 17, 2024, 10:25am

guanqun-yang:

generation_output = model.generate(
            input_ids=torch.tensor([input_ids]).to(DEVICE),
            attention_mask=torch.tensor([attention_mask]).to(DEVICE),
            output_scores=True,
            return_dict_in_generate=True,
            do_sample=True,
            top_p=1,
            num_return_sequences=1,
            max_new_tokens=max_new_tokens,
        )

Hi! You have to disable sampling to get exactly same results, in other words the generation should return the token id with maximum probability.

MrRobot · April 17, 2024, 1:54pm

I once met similar problems as @guanqun-yang, thank you for providing the MWE!

@RaushanTurganbay Can you elaborate why the sampling has to be disabled? Is this related to the seed: ‘np.random.randn(16)’ is not same as ‘np.random.randn()’ for 16 times?

RaushanTurganbay · April 17, 2024, 2:18pm

@MrRobot Sampling means that the next token is sampled from logits distribution, instead of getting the one with maximum probability. For example the below script will print different tokens every time we sample.

dummy_logits_normalized = torch.randn(10, 100).softmax(axis=-1)
for _ in range(10):
    sampled_tokens = torch.multinomial(dummy_logits_normalized, num_samples=1)
    print(sampled_tokens)

Some dicsussion on the torch.multinomial can be found here

system · April 18, 2024, 2:18am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Prevent repeat tokens in GPT2LMHeadModel text generation with max_new_tokens=1 Beginners	0	1119	November 19, 2021
What makes the built-in generate method faster than a crude manual implementation? 🤗Transformers	3	2007	January 19, 2024
Different results from `model.generate` depending on batch size? Beginners	3	1515	October 15, 2023
Difference between pipeline and model.generate? 🤗Transformers	2	2599	February 26, 2024
Argmax of Generation Probabilities doesn't match with Generated Sequence Tokens 🤗Transformers	2	953	May 10, 2024

Generating Once for 16 Tokens is Not Same Generating Single Token 16 Times?

Related topics