Ask for help: Output inconsistency when using LLM batch inference compared to single input

Skyuan · March 18, 2025, 3:09pm

I found single LLM input get different output logits when merging into a batch for inference.

Besides, I need to use inputs_embeds as model input.

My test LLM is “Qwen/Qwen2.5-1.5B-Instruct” and the test code is below.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# load model and tokenizezr
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# set model eval
model.eval()

# input texts
texts = ['a', 'b', 'c']

# tokenize
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(model.device)

# get inputs_embeds
with torch.no_grad():
    inputs_embeds = model.get_input_embeddings()(inputs.input_ids)

# get attention_mask and position_ids
attention_mask = inputs.attention_mask
position_ids = torch.arange(inputs.input_ids.shape[1], device=model.device).unsqueeze(0).expand(inputs.input_ids.shape[0], -1)

# batch
with torch.no_grad():
    output_batch = model(
        inputs_embeds=inputs_embeds,
        attention_mask=attention_mask,
        position_ids=position_ids
    ).logits[0]

# single
with torch.no_grad():
    output_single = model(
        inputs_embeds=inputs_embeds[0].unsqueeze(0), 
        attention_mask=attention_mask[0].unsqueeze(0),
        position_ids=position_ids[0].unsqueeze(0)
    ).logits[0] 

# check consistency
is_close = torch.allclose(output_batch, output_single, atol=1e-5, rtol=1e-3)
print(is_close)

I tried all the methods Deepseek suggested and then all failed, like setting attention masks, positions and so on.

I want the same output logits of a single input text as the ones extracted from batch output results.

John6666 · March 18, 2025, 3:58pm

Perhaps KV-cache issue?

github.com/huggingface/transformers

LLM inference with static kv-cache example gives different generations depending on the batch examples

opened 10:13AM - 06 May 24 UTC

closed 08:04AM - 01 Jul 24 UTC

jpiabrantes

### System Info transformers.__version__ = 4.41.0.dev0 Python 3.10.12 unix … ### Who can help? @gante ### Information - [X] The official example scripts - [ ] My own modified scripts ### Tasks - [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction When using the code documented [here](https://huggingface.co/docs/transformers/main/en/llm_optims?static-kv=Static+Cache#static-kv-cache-and-torchcompile), we get different generations for the same prompt when generating it in a different batch. For example if you run the code bellow 1st generation for the prompt hey is different than the 2nd generation. ```python import os from transformers import LlamaTokenizer, LlamaForCausalLM from transformers import StaticCache import torch def decode_one_tokens(model, cur_token, input_pos, cache_position, past_key_values): logits = model( cur_token, position_ids=input_pos, cache_position=cache_position, past_key_values=past_key_values, return_dict=False, use_cache=True )[0] new_token = torch.argmax(logits[:, -1], dim=-1)[:, None] return new_token def generate(prompts: list[str], model: LlamaForCausalLM, tokenizer: LlamaTokenizer, num_tokens_to_generate: int = 40) -> list[str]: global decode_one_tokens inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device) batch_size, seq_length = inputs["input_ids"].shape with torch.no_grad(): past_key_values = StaticCache( config=model.config, max_batch_size=batch_size, max_cache_len=4096, device=torch_device, dtype=model.dtype ) cache_position = torch.arange(seq_length, device=torch_device) generated_ids = torch.zeros( batch_size, seq_length + num_tokens_to_generate + 1, dtype=torch.int, device=torch_device ) generated_ids[:, cache_position] = inputs["input_ids"].to(torch_device).to(torch.int) with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_mem_efficient=False, enable_math=True): logits = model( **inputs, cache_position=cache_position, past_key_values=past_key_values,return_dict=False, use_cache=True )[0] next_token = torch.argmax(logits[:, -1], dim=-1)[:, None] generated_ids[:, seq_length] = next_token[:, 0] # Not using torch.compile to simplify debugging # decode_one_tokens = torch.compile(decode_one_tokens, mode="reduce-overhead", fullgraph=True) cache_position = torch.tensor([seq_length + 1], device=torch_device) for _ in range(1, num_tokens_to_generate): with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_mem_efficient=False, enable_math=True): next_token = decode_one_tokens(model, next_token.clone(), None, cache_position, past_key_values) generated_ids[:, cache_position] = next_token.int() cache_position += 1 return tokenizer.batch_decode(generated_ids, skip_special_tokens=True) if __name__ == "__main__": torch_device = "cuda" tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", pad_token="</s>", padding_side="right") model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to(torch_device) model.eval() print(generate(["hey", "yo"], model, tokenizer)[0]) # prints: hey, i'm alex.\ni'm a 20-something year old photographer and filmmaker based in los angeles. i'm a lover of all things cre print('-'*50 + '\n'*2) print(generate(["hey", "yo code math"], model, tokenizer)[0]) #prints: hey The 2018-19 season is the 10th season of the National Women's Soccer League (NWSL), the top division of women's soccer print('-'*50 + '\n'*2) ``` ### Expected behavior The generation for each prompt should not depend on the other examples in the batch.

And Tips by Hugging Chat

To address the inconsistency in logits between single and batch inputs when using inputs_embeds, ensure that the inputs_embeds match the model’s data type. Convert inputs_embeds to the model’s torch_dtype before inference. Modify the code as follows:

# get inputs_embeds
with torch.no_grad():
    inputs_embeds = model.get_input_embeddings()(inputs.input_ids)
    # Ensure inputs_embeds are in the correct dtype
    if model.config.torch_dtype is not None:
        inputs_embeds = inputs_embeds.to(model.config.torch_dtype)

This converts the embeddings to the model’s expected dtype, ensuring consistency between single and batch inference.

Answer:
ensure that inputs_embeds are converted to the model’s torch_dtype before inference. Modify the code by adding the dtype conversion step:

# Add this line after getting inputs_embeds
inputs_embeds = inputs_embeds.to(model.config.torch_dtype)

This adjustment ensures that the data types are consistent between batch and single inputs, resolving the inconsistency issue [2].

Skyuan · March 19, 2025, 3:18am

Thanks for your help!
I tried the methods mentioned the above posts, including setting “use_cache=False”, manually setting attention mask and making sure of the same dype, but all failed.
I further found that only “cuda” causes the inconsistency problem and “cpu” works fine, but still struggle to make “cuda” batch inference procuce consistent results.
The numerical gap is kind of big:
tensor([[ 7.5312, 9.3750, 6.0625, …, -3.6250, -3.6250, -3.6250]],
device=‘cuda:0’) # batch
tensor([[ 7.2812, 9.2500, 6.2188, …, -3.7969, -3.7969, -3.7969]],
device=‘cuda:0’) # single

John6666 · March 19, 2025, 4:56am

It seems to happen with anything other than torch.float32, and it seems to be particularly noticeable with torch.bfloat16. There are also some who point out that it is a unique problem with Qwen 2.5.
With bfloat16, Attention may also be suspicious.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# load model and tokenizezr
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
).eval().to(torch.float32) # if bfloat16, it causes inconsistency
tokenizer = AutoTokenizer.from_pretrained(model_name)

print("model.dtype: ", model.dtype)
print("model.device: ", model.device)

# input texts
texts = ['a', 'b', 'c']

# tokenize
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(model.device)

# get inputs_embeds
with torch.no_grad():
    inputs_embeds = model.get_input_embeddings()(inputs.input_ids)

# get attention_mask and position_ids
attention_mask = inputs.attention_mask
position_ids = torch.arange(inputs.input_ids.shape[1], device=model.device).unsqueeze(0).expand(inputs.input_ids.shape[0], -1)

# batch
with torch.no_grad():
    output_batch = model(
        inputs_embeds=inputs_embeds,
        attention_mask=attention_mask,
        position_ids=position_ids
    ).logits[0]

# single
with torch.no_grad():
    output_single = model(
        inputs_embeds=inputs_embeds[0].unsqueeze(0), 
        attention_mask=attention_mask[0].unsqueeze(0),
        position_ids=position_ids[0].unsqueeze(0)
    ).logits[0] 

# check consistency
is_close = torch.allclose(output_batch, output_single, atol=1e-5, rtol=1e-3)
print("consistent?: ", is_close)
print("batch: ", output_batch)
print("single: ", output_single)

torch.bfloat16

model.dtype:  torch.bfloat16
model.device:  cuda:0
  attn_output = torch.nn.functional.scaled_dot_product_attention(
consistent?:  False
batch:  tensor([[ 7.4688,  9.3125,  6.0625,  ..., -3.5469, -3.5469, -3.5469]],
       device='cuda:0', dtype=torch.bfloat16)
single:  tensor([[ 6.9375,  8.9375,  5.9375,  ..., -3.7188, -3.7188, -3.7188]],
       device='cuda:0', dtype=torch.bfloat16)

torch.float32

model.dtype:  torch.float32
model.device:  cuda:0
consistent?:  False
batch:  tensor([[ 7.6105,  9.9338,  6.7679,  ..., -3.6860, -3.6860, -3.6861]],
       device='cuda:0')
single:  tensor([[ 7.6105,  9.9338,  6.7679,  ..., -3.6860, -3.6860, -3.6861]],
       device='cuda:0')

Possible causes

github.com/unslothai/unsloth

Batch inference produces inconsistent results for self-trained model

opened 10:14AM - 20 Dec 24 UTC

Xyuan13

unsure bug?

I am experiencing an issue with batch inference using my self-trained model. Whe…n I perform inference on single samples, the results are consistent and correct. However, when I perform inference on batches of multiple samples, the results differ unexpectedly. I also find it strange that the outputs of batch inference change when I alter the batch size. I’ve tested batch sizes ranging from 8 to 64, and the inconsistencies increase with larger batch sizes. I've updated the unsloth version to 2024.12.4 and also set padding_side to 'left' and set tokenizer.pad_token = tokenizer.unk_token, it still not work. Here is my code ``` max_seq_length = 1024 # in case of truncate dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+ load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False. model, tokenizer = FastLanguageModel.from_pretrained( model_name = f"{weights_path}", # YOUR MODEL YOU USED FOR TRAINING max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit, ) FastLanguageModel.for_inference(model) tokenizer.padding_side='left' tokenizer.pad_token = tokenizer.unk_token batch_size = 4 #32 #64 # Prepare the batch batch_input_strs = [] batch_data_js_items = [] # Iterate through data and prepare inputs in batches for idx, data_js_item in enumerate(data_js[:eval_item_num]): input_str = f"game_record:{data_js_item['game_record']}, 'target_player':{data_js_item['target_player']}" batch_input_strs.append(input_str) batch_data_js_items.append(data_js_item) # Once the batch size is reached, or we've processed the last item if len(batch_input_strs) == batch_size or idx == len(data_js[:eval_item_num]) - 1: # Prepare batch inputs for tokenizer inputs = tokenizer( [alpaca_prompt.format( INSTRCTION, input_str, "",) for input_str in batch_input_strs ], return_tensors="pt",padding=True, truncation=True).to("cuda") # Perform batch inference outputs = model.generate(**inputs, max_new_tokens=1024, use_cache=True, do_sample = False) # Decode the batch outputs output_lst = tokenizer.batch_decode(outputs) for ouput_token in output_lst: ouput_token = ouput_token.replace(tokenizer.pad_token, "") # Process results for each item in the batch for i, output_text in enumerate(output_lst): # Extract the response text from the model output s_idx = output_text.find("### Response:\n") + len("### Response:\n") e_idx = output_text.find(EOS_TOKEN) predict_str = output_text[s_idx:e_idx] batch_input_strs = [] batch_data_js_items = [] ```

Skyuan · March 20, 2025, 3:47am

Thanks for your suggestions, and I finally decide to use float32.

In addition, in this case of quantized LLMs, turning model to float32 still gives inconsistent outputs, perhaps because quantized LLM has its special mechanism for some mathematical operations which can’t be transfered to float32 model by simply changing its parametres’ dtype. (model.dequantize() raises some NotImplementedError when using Qwen2.5-1.5B-Instruct) I compromise by decomposing the batch to single inputs and bearing the low efficiency.

Topic		Replies	Views
Results of model.generate are different for different batch sizes of the decode-only model Beginners	6	6096	April 14, 2024
Batch inference using open source LLMs 🤗Transformers	1	2040	August 30, 2023
Llama2 pad token for batched inference Models	7	15634	March 31, 2024
Why do the value of logits change depending on whether samples are batched or not? Beginners	1	329	August 20, 2024
Llama model outputs strange words Beginners	0	134	December 1, 2024

Ask for help: Output inconsistency when using LLM batch inference compared to single input

torch.bfloat16

torch.float32

Possible causes

Related topics