Thanks for your suggestions, and I finally decide to use float32.
In addition, in this case of quantized LLMs, turning model to float32 still gives inconsistent outputs, perhaps because quantized LLM has its special mechanism for some mathematical operations which can’t be transfered to float32 model by simply changing its parametres’ dtype. (model.dequantize() raises some NotImplementedError when using Qwen2.5-1.5B-Instruct) I compromise by decomposing the batch to single inputs and bearing the low efficiency.