Baffling performance issue on most NVidia GPUs with simple transformers + pytorch code

TheBloke · May 9, 2023, 11:37pm

I think I figured it out. It’s depressingly mundane.

It’s bottlenecked on CPU.

The systems that perform better apparently just have much better single-core CPU performance. The one Runpod system I found that performs so well has an Intel i9-13900K, and it seems very likely that it’s able to achieve much higher single-core CPU performance vs the high-core-count AMD CPUs on the other servers I’ve tested.

I don’t suppose there’s any way to utilise CPU multi-threading in a simple call like:
output = model.generate(...) ? I’m assuming not.

Or any way to prepare the model for more efficient usage in terms of CPU and data transfer between CPU and GPU?

I’m sure batching prompts could help, but in this particular case I’m looking for single-prompt performance.

Topic		Replies	Views
[Help] GPU with query answering 🤗Transformers	0	333	November 25, 2020
Is there any way to avoid CPU bottlenecks when doing single prompt inference? Intermediate	1	1011	June 12, 2023
GPU inference slows down if done in a loop 🤗Transformers	1	1591	July 20, 2020
Run_ner.py slower on multi-GPU than single GPU Beginners	1	1810	September 23, 2020
Bfloat16 conversion results in significantly slower computation for various transformer models 🤗Transformers	0	1448	December 20, 2021

Baffling performance issue on most NVidia GPUs with simple transformers + pytorch code

Related topics