Ask for help: Output inconsistency when using LLM batch inference compared to single input

Skyuan · March 20, 2025, 3:47am

Thanks for your suggestions, and I finally decide to use float32.

In addition, in this case of quantized LLMs, turning model to float32 still gives inconsistent outputs, perhaps because quantized LLM has its special mechanism for some mathematical operations which can’t be transfered to float32 model by simply changing its parametres’ dtype. (model.dequantize() raises some NotImplementedError when using Qwen2.5-1.5B-Instruct) I compromise by decomposing the batch to single inputs and bearing the low efficiency.

Topic		Replies	Views
Results of model.generate are different for different batch sizes of the decode-only model Beginners	6	6139	April 14, 2024
Batch inference using open source LLMs 🤗Transformers	1	2042	August 30, 2023
Llama2 pad token for batched inference Models	7	15663	March 31, 2024
Why do the value of logits change depending on whether samples are batched or not? Beginners	1	334	August 20, 2024
Llama model outputs strange words Beginners	0	140	December 1, 2024

Ask for help: Output inconsistency when using LLM batch inference compared to single input

Related topics