Ask for help: Output inconsistency when using LLM batch inference compared to single input

Skyuan · March 19, 2025, 3:18am

Thanks for your help!
I tried the methods mentioned the above posts, including setting “use_cache=False”, manually setting attention mask and making sure of the same dype, but all failed.
I further found that only “cuda” causes the inconsistency problem and “cpu” works fine, but still struggle to make “cuda” batch inference procuce consistent results.
The numerical gap is kind of big:
tensor([[ 7.5312, 9.3750, 6.0625, …, -3.6250, -3.6250, -3.6250]],
device=‘cuda:0’) # batch
tensor([[ 7.2812, 9.2500, 6.2188, …, -3.7969, -3.7969, -3.7969]],
device=‘cuda:0’) # single

Topic		Replies	Views
Results of model.generate are different for different batch sizes of the decode-only model Beginners	6	6139	April 14, 2024
Batch inference using open source LLMs 🤗Transformers	1	2042	August 30, 2023
Llama2 pad token for batched inference Models	7	15663	March 31, 2024
Why do the value of logits change depending on whether samples are batched or not? Beginners	1	334	August 20, 2024
Llama model outputs strange words Beginners	0	140	December 1, 2024

Ask for help: Output inconsistency when using LLM batch inference compared to single input

Related topics