Results of model.generate are different for different batch sizes of the decode-only model

Hello, I am trying the llama model, which is a code only autoregressive generation model.
I try to make my model able to accept input from a batch at a time and generate decoding results. However, I found that inputting samples with a batch size greater than 1 at a time can make the generated results unstable.
Specifically, I tried

inputs = tokenizer("prompt", return_tensors="pt")
input_ids = inputs["input_ids"].to(device)
input_ids,inputs

get

(tensor([[   1, 9508]], device='cuda:0'),
 {'input_ids': tensor([[   1, 9508]]), 'attention_mask': tensor([[1, 1]])})

And

if tokenizer.pad_token is None:
            tokenizer.add_special_tokens({'pad_token': '[PAD]'})
            model.resize_token_embeddings(len(tokenizer))
inputs_b = tokenizer(["prompt","prompt","prompt"], return_tensors="pt", padding=True).to(device)
input_idsb=inputs_b["input_ids"].to(device)
input_idsb,inputs_b

get

(tensor([[   1, 9508],
         [   1, 9508],
         [   1, 9508]], device='cuda:0'),
 {'input_ids': tensor([[   1, 9508],
         [   1, 9508],
         [   1, 9508]], device='cuda:0'), 'attention_mask': tensor([[1, 1],
         [1, 1],
         [1, 1]], device='cuda:0')})

You can see that the tensor for each item in the input_ids is the same, which is the same as my understanding, because the same words are mapped to the same vector.
When I used almost the same method and the same parameters to generate, a strange thing happened, and they returned different Tensors.

generation_config = GenerationConfig(
    temperature=1,
    top_p=1,
    top_k=50,
    num_beams=1,
    max_new_tokens=128,
)
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
    )
generation_output

and

generation_config = GenerationConfig(
    temperature=1,
    top_p=1,
    top_k=50,
    num_beams=1,
    max_new_tokens=128,
)
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_idsb,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
    )
generation_output

For samples with a batchsize of 1, decoding yields

" promptly and efficiently.\nThe Company shall not be liable to the Customer for any loss or damage suffered by the Customer as a result of any delay in the delivery of the Goods (even if caused by the Company's negligence) unless the Customer has given written notice to the Company of the delay within 7 days of the date when the Goods were due to be delivered.\nThe Company shall not be liable to the Customer for any loss or damage suffered by the Customer as a result of any delay in the delivery of the Goods (even if caused by the Company's negligence) unless the"

While for samples with a batchsize of 3, decoding yields

[' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in',
 ' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in',
 ' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in']

I want to know why it makes their results different, that is, why each sample is different from batchsize=1 when batchsize=3, when their input tensor wants to wait. And what can I do to fix it so that they have the same results as when they were batchsize 1 (because they are more stable).

3 Likes

I am facing the same issue. @muellerzr can you help?

1 Like

Same observation made. Using Llama-2 with LoRA adaptor
This is my code:

tokenizer.padding_side = 'left'
inputs = tokenizer(
    test_instructions,  # len == 8
    return_tensors="pt",
    padding=True,
    truncation=True,
)

inputs = {k: v.to(device) for k, v in inputs.items() if k in ['input_ids', 'attention_mask']}
outputs = model.generate(**inputs,
                                          generation_config=generation_config, max_new_tokens=50, temperature=0,
                                          min_length=2
                                         )
responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)

Even though I set min_length to 2, my first instance does not have any generations before the token.

I’ve encountered the same issue where feeding [[sentence A]] or [[sentence A],[sentence A]] resulted in different outputs. Upon examining the logits of the output token where the difference begins, I discovered that this location had two candidate tokens with similar logit scores. Additionally, when I switched my model back to float32 instead of bfloat16, the inconsistency of outputs disappeared. I suspect that the problem may be due to subtle rounding issues. However, I’m also curious as to why this can result in discrepancies when varying the batch size. @THEATLAS @TopRightExit

cc @joaogante who had a thread on this

Sounds reasonable

1 Like

I am also seeing this behavior with a fine-tuned mistralai/Mistral-7B-v0.1 and HF Generate. The outputs are slightly different when I increase the batch size from 1 to >1. I am using beam search decoding with do_sample=False. There’s a partial explanation here: model generate with different batch size but get different results · Issue #23017 · huggingface/transformers · GitHub