Ask for help: Output inconsistency when using LLM batch inference compared to single input

John6666 · March 19, 2025, 4:56am

It seems to happen with anything other than torch.float32, and it seems to be particularly noticeable with torch.bfloat16. There are also some who point out that it is a unique problem with Qwen 2.5.
With bfloat16, Attention may also be suspicious.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# load model and tokenizezr
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
).eval().to(torch.float32) # if bfloat16, it causes inconsistency
tokenizer = AutoTokenizer.from_pretrained(model_name)

print("model.dtype: ", model.dtype)
print("model.device: ", model.device)

# input texts
texts = ['a', 'b', 'c']

# tokenize
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(model.device)

# get inputs_embeds
with torch.no_grad():
    inputs_embeds = model.get_input_embeddings()(inputs.input_ids)

# get attention_mask and position_ids
attention_mask = inputs.attention_mask
position_ids = torch.arange(inputs.input_ids.shape[1], device=model.device).unsqueeze(0).expand(inputs.input_ids.shape[0], -1)

# batch
with torch.no_grad():
    output_batch = model(
        inputs_embeds=inputs_embeds,
        attention_mask=attention_mask,
        position_ids=position_ids
    ).logits[0]

# single
with torch.no_grad():
    output_single = model(
        inputs_embeds=inputs_embeds[0].unsqueeze(0), 
        attention_mask=attention_mask[0].unsqueeze(0),
        position_ids=position_ids[0].unsqueeze(0)
    ).logits[0] 

# check consistency
is_close = torch.allclose(output_batch, output_single, atol=1e-5, rtol=1e-3)
print("consistent?: ", is_close)
print("batch: ", output_batch)
print("single: ", output_single)

torch.bfloat16

model.dtype:  torch.bfloat16
model.device:  cuda:0
  attn_output = torch.nn.functional.scaled_dot_product_attention(
consistent?:  False
batch:  tensor([[ 7.4688,  9.3125,  6.0625,  ..., -3.5469, -3.5469, -3.5469]],
       device='cuda:0', dtype=torch.bfloat16)
single:  tensor([[ 6.9375,  8.9375,  5.9375,  ..., -3.7188, -3.7188, -3.7188]],
       device='cuda:0', dtype=torch.bfloat16)

torch.float32

model.dtype:  torch.float32
model.device:  cuda:0
consistent?:  False
batch:  tensor([[ 7.6105,  9.9338,  6.7679,  ..., -3.6860, -3.6860, -3.6861]],
       device='cuda:0')
single:  tensor([[ 7.6105,  9.9338,  6.7679,  ..., -3.6860, -3.6860, -3.6861]],
       device='cuda:0')

Possible causes

github.com/unslothai/unsloth

Batch inference produces inconsistent results for self-trained model

opened 10:14AM - 20 Dec 24 UTC

Xyuan13

unsure bug?

I am experiencing an issue with batch inference using my self-trained model. Whe…n I perform inference on single samples, the results are consistent and correct. However, when I perform inference on batches of multiple samples, the results differ unexpectedly. I also find it strange that the outputs of batch inference change when I alter the batch size. I’ve tested batch sizes ranging from 8 to 64, and the inconsistencies increase with larger batch sizes. I've updated the unsloth version to 2024.12.4 and also set padding_side to 'left' and set tokenizer.pad_token = tokenizer.unk_token, it still not work. Here is my code ``` max_seq_length = 1024 # in case of truncate dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+ load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False. model, tokenizer = FastLanguageModel.from_pretrained( model_name = f"{weights_path}", # YOUR MODEL YOU USED FOR TRAINING max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit, ) FastLanguageModel.for_inference(model) tokenizer.padding_side='left' tokenizer.pad_token = tokenizer.unk_token batch_size = 4 #32 #64 # Prepare the batch batch_input_strs = [] batch_data_js_items = [] # Iterate through data and prepare inputs in batches for idx, data_js_item in enumerate(data_js[:eval_item_num]): input_str = f"game_record:{data_js_item['game_record']}, 'target_player':{data_js_item['target_player']}" batch_input_strs.append(input_str) batch_data_js_items.append(data_js_item) # Once the batch size is reached, or we've processed the last item if len(batch_input_strs) == batch_size or idx == len(data_js[:eval_item_num]) - 1: # Prepare batch inputs for tokenizer inputs = tokenizer( [alpaca_prompt.format( INSTRCTION, input_str, "",) for input_str in batch_input_strs ], return_tensors="pt",padding=True, truncation=True).to("cuda") # Perform batch inference outputs = model.generate(**inputs, max_new_tokens=1024, use_cache=True, do_sample = False) # Decode the batch outputs output_lst = tokenizer.batch_decode(outputs) for ouput_token in output_lst: ouput_token = ouput_token.replace(tokenizer.pad_token, "") # Process results for each item in the batch for i, output_text in enumerate(output_lst): # Extract the response text from the model output s_idx = output_text.find("### Response:\n") + len("### Response:\n") e_idx = output_text.find(EOS_TOKEN) predict_str = output_text[s_idx:e_idx] batch_input_strs = [] batch_data_js_items = [] ```

Topic		Replies	Views
Results of model.generate are different for different batch sizes of the decode-only model Beginners	6	6139	April 14, 2024
Batch inference using open source LLMs 🤗Transformers	1	2042	August 30, 2023
Llama2 pad token for batched inference Models	7	15663	March 31, 2024
Why do the value of logits change depending on whether samples are batched or not? Beginners	1	334	August 20, 2024
Llama model outputs strange words Beginners	0	140	December 1, 2024

Ask for help: Output inconsistency when using LLM batch inference compared to single input

torch.bfloat16

torch.float32

Possible causes

Related topics