Ask for help: Output inconsistency when using LLM batch inference compared to single input

Perhaps KV-cache issue?

And Tips by Hugging Chat


To address the inconsistency in logits between single and batch inputs when using inputs_embeds, ensure that the inputs_embeds match the model’s data type. Convert inputs_embeds to the model’s torch_dtype before inference. Modify the code as follows:

# get inputs_embeds
with torch.no_grad():
    inputs_embeds = model.get_input_embeddings()(inputs.input_ids)
    # Ensure inputs_embeds are in the correct dtype
    if model.config.torch_dtype is not None:
        inputs_embeds = inputs_embeds.to(model.config.torch_dtype)

This converts the embeddings to the model’s expected dtype, ensuring consistency between single and batch inference.

Answer:
ensure that inputs_embeds are converted to the model’s torch_dtype before inference. Modify the code by adding the dtype conversion step:

# Add this line after getting inputs_embeds
inputs_embeds = inputs_embeds.to(model.config.torch_dtype)

This adjustment ensures that the data types are consistent between batch and single inputs, resolving the inconsistency issue [2].