Results of model.generate are different for different batch sizes of the decode-only model

THEATLAS · March 28, 2023, 8:17am

Hello, I am trying the llama model, which is a code only autoregressive generation model.
I try to make my model able to accept input from a batch at a time and generate decoding results. However, I found that inputting samples with a batch size greater than 1 at a time can make the generated results unstable.
Specifically, I tried

inputs = tokenizer("prompt", return_tensors="pt")
input_ids = inputs["input_ids"].to(device)
input_ids,inputs

get

(tensor([[   1, 9508]], device='cuda:0'),
 {'input_ids': tensor([[   1, 9508]]), 'attention_mask': tensor([[1, 1]])})

And

if tokenizer.pad_token is None:
            tokenizer.add_special_tokens({'pad_token': '[PAD]'})
            model.resize_token_embeddings(len(tokenizer))
inputs_b = tokenizer(["prompt","prompt","prompt"], return_tensors="pt", padding=True).to(device)
input_idsb=inputs_b["input_ids"].to(device)
input_idsb,inputs_b

get

(tensor([[   1, 9508],
         [   1, 9508],
         [   1, 9508]], device='cuda:0'),
 {'input_ids': tensor([[   1, 9508],
         [   1, 9508],
         [   1, 9508]], device='cuda:0'), 'attention_mask': tensor([[1, 1],
         [1, 1],
         [1, 1]], device='cuda:0')})

You can see that the tensor for each item in the input_ids is the same, which is the same as my understanding, because the same words are mapped to the same vector.
When I used almost the same method and the same parameters to generate, a strange thing happened, and they returned different Tensors.

generation_config = GenerationConfig(
    temperature=1,
    top_p=1,
    top_k=50,
    num_beams=1,
    max_new_tokens=128,
)
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
    )
generation_output

and

generation_config = GenerationConfig(
    temperature=1,
    top_p=1,
    top_k=50,
    num_beams=1,
    max_new_tokens=128,
)
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_idsb,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
    )
generation_output

For samples with a batchsize of 1, decoding yields

" promptly and efficiently.\nThe Company shall not be liable to the Customer for any loss or damage suffered by the Customer as a result of any delay in the delivery of the Goods (even if caused by the Company's negligence) unless the Customer has given written notice to the Company of the delay within 7 days of the date when the Goods were due to be delivered.\nThe Company shall not be liable to the Customer for any loss or damage suffered by the Customer as a result of any delay in the delivery of the Goods (even if caused by the Company's negligence) unless the"

While for samples with a batchsize of 3, decoding yields

[' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in',
 ' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in',
 ' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in']

I want to know why it makes their results different, that is, why each sample is different from batchsize=1 when batchsize=3, when their input tensor wants to wait. And what can I do to fix it so that they have the same results as when they were batchsize 1 (because they are more stable).

varadhbhatnagar · July 24, 2023, 8:40am

I am facing the same issue. @muellerzr can you help?

TopRightExit · October 15, 2023, 7:55am

Same observation made. Using Llama-2 with LoRA adaptor
This is my code:

tokenizer.padding_side = 'left'
inputs = tokenizer(
    test_instructions,  # len == 8
    return_tensors="pt",
    padding=True,
    truncation=True,
)

inputs = {k: v.to(device) for k, v in inputs.items() if k in ['input_ids', 'attention_mask']}
outputs = model.generate(**inputs,
                                          generation_config=generation_config, max_new_tokens=50, temperature=0,
                                          min_length=2
                                         )
responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)

Even though I set min_length to 2, my first instance does not have any generations before the token.

dwzhu · December 21, 2023, 7:45am

I’ve encountered the same issue where feeding [[sentence A]] or [[sentence A],[sentence A]] resulted in different outputs. Upon examining the logits of the output token where the difference begins, I discovered that this location had two candidate tokens with similar logit scores. Additionally, when I switched my model back to float32 instead of bfloat16, the inconsistency of outputs disappeared. I suspect that the problem may be due to subtle rounding issues. However, I’m also curious as to why this can result in discrepancies when varying the batch size. @THEATLAS @TopRightExit

nielsr · December 21, 2023, 8:03am

cc @joaogante who had a thread on this

THEATLAS · December 22, 2023, 12:40am

Sounds reasonable

johngiorgi · April 14, 2024, 9:11pm

I am also seeing this behavior with a fine-tuned mistralai/Mistral-7B-v0.1 and HF Generate. The outputs are slightly different when I increase the batch size from 1 to >1. I am using beam search decoding with do_sample=False. There’s a partial explanation here: model generate with different batch size but get different results · Issue #23017 · huggingface/transformers · GitHub

Topic		Replies	Views
LLaMA2 - tokenizer padding affecting logits (even with attention_mask) 🤗Transformers	8	3507	March 26, 2024
Unisloth 4-bit Llama models acting weirdly when used in a Function Beginners	0	112	May 8, 2024
Making llama text generation, deterministic Models	1	6418	August 16, 2023
Tokenizer setting for model = LlamaForCausalLM.from_pretrained(model_path, device_map='auto') Models	0	882	August 25, 2023
Generation / Inference Models	0	204	December 11, 2023

Results of model.generate are different for different batch sizes of the decode-only model

Related Topics