Ask for help: Output inconsistency when using LLM batch inference compared to single input

Thanks for your help!
I tried the methods mentioned the above posts, including setting “use_cache=False”, manually setting attention mask and making sure of the same dype, but all failed.
I further found that only “cuda” causes the inconsistency problem and “cpu” works fine, but still struggle to make “cuda” batch inference procuce consistent results.
The numerical gap is kind of big:
tensor([[ 7.5312, 9.3750, 6.0625, …, -3.6250, -3.6250, -3.6250]],
device=‘cuda:0’) # batch
tensor([[ 7.2812, 9.2500, 6.2188, …, -3.7969, -3.7969, -3.7969]],
device=‘cuda:0’) # single

1 Like