I think I figured it out. It’s depressingly mundane.
It’s bottlenecked on CPU.
The systems that perform better apparently just have much better single-core CPU performance. The one Runpod system I found that performs so well has an Intel i9-13900K, and it seems very likely that it’s able to achieve much higher single-core CPU performance vs the high-core-count AMD CPUs on the other servers I’ve tested.
I don’t suppose there’s any way to utilise CPU multi-threading in a simple call like:
output = model.generate(...)
? I’m assuming not.
Or any way to prepare the model for more efficient usage in terms of CPU and data transfer between CPU and GPU?
I’m sure batching prompts could help, but in this particular case I’m looking for single-prompt performance.