To address the inconsistency in logits between single and batch inputs when using inputs_embeds, ensure that the inputs_embeds match the model’s data type. Convert inputs_embeds to the model’s torch_dtype before inference. Modify the code as follows:
# get inputs_embeds
with torch.no_grad():
inputs_embeds = model.get_input_embeddings()(inputs.input_ids)
# Ensure inputs_embeds are in the correct dtype
if model.config.torch_dtype is not None:
inputs_embeds = inputs_embeds.to(model.config.torch_dtype)
This converts the embeddings to the model’s expected dtype, ensuring consistency between single and batch inference.
Answer:
ensure that inputs_embeds are converted to the model’s torch_dtype before inference. Modify the code by adding the dtype conversion step:
# Add this line after getting inputs_embeds
inputs_embeds = inputs_embeds.to(model.config.torch_dtype)
This adjustment ensures that the data types are consistent between batch and single inputs, resolving the inconsistency issue [2].
Thanks for your help!
I tried the methods mentioned the above posts, including setting “use_cache=False”, manually setting attention mask and making sure of the same dype, but all failed.
I further found that only “cuda” causes the inconsistency problem and “cpu” works fine, but still struggle to make “cuda” batch inference procuce consistent results.
The numerical gap is kind of big:
tensor([[ 7.5312, 9.3750, 6.0625, …, -3.6250, -3.6250, -3.6250]],
device=‘cuda:0’) # batch
tensor([[ 7.2812, 9.2500, 6.2188, …, -3.7969, -3.7969, -3.7969]],
device=‘cuda:0’) # single
It seems to happen with anything other than torch.float32, and it seems to be particularly noticeable with torch.bfloat16. There are also some who point out that it is a unique problem with Qwen 2.5.
With bfloat16, Attention may also be suspicious.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# load model and tokenizezr
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
).eval().to(torch.float32) # if bfloat16, it causes inconsistency
tokenizer = AutoTokenizer.from_pretrained(model_name)
print("model.dtype: ", model.dtype)
print("model.device: ", model.device)
# input texts
texts = ['a', 'b', 'c']
# tokenize
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(model.device)
# get inputs_embeds
with torch.no_grad():
inputs_embeds = model.get_input_embeddings()(inputs.input_ids)
# get attention_mask and position_ids
attention_mask = inputs.attention_mask
position_ids = torch.arange(inputs.input_ids.shape[1], device=model.device).unsqueeze(0).expand(inputs.input_ids.shape[0], -1)
# batch
with torch.no_grad():
output_batch = model(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
position_ids=position_ids
).logits[0]
# single
with torch.no_grad():
output_single = model(
inputs_embeds=inputs_embeds[0].unsqueeze(0),
attention_mask=attention_mask[0].unsqueeze(0),
position_ids=position_ids[0].unsqueeze(0)
).logits[0]
# check consistency
is_close = torch.allclose(output_batch, output_single, atol=1e-5, rtol=1e-3)
print("consistent?: ", is_close)
print("batch: ", output_batch)
print("single: ", output_single)
Thanks for your suggestions, and I finally decide to use float32.
In addition, in this case of quantized LLMs, turning model to float32 still gives inconsistent outputs, perhaps because quantized LLM has its special mechanism for some mathematical operations which can’t be transfered to float32 model by simply changing its parametres’ dtype. (model.dequantize() raises some NotImplementedError when using Qwen2.5-1.5B-Instruct) I compromise by decomposing the batch to single inputs and bearing the low efficiency.