Baffling performance issue on most NVidia GPUs with simple transformers + pytorch code

Hi all

I am hoping that someone can help me get to the bottom of a perplexing performance problem that I’ve discovered while benchmarking language model inference using transformers + pytorch 2.0.0.

I was testing float16 inference on pytorch.bin format models, as well as 4bit quantisation with GPTQ. I don’t own an NV GPU myself, so I was using a cloud GPU provider (Runpod) with a 4090.

Long story short, I posted my benchmarks for float16 and int4 benchmarks, thinking they were fine. Someone told me that the performance seemed much too low. So I did some more digging and wrote this little test script:

import time
import torch
from transformers import AutoTokenizer,AutoModelForCausalLM

pretrained_model_dir = "facebook/opt-125m"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
print("Loading model")
model = AutoModelForCausalLM.from_pretrained(pretrained_model_dir).eval().to("cuda:0")
print("Model loaded")

num_runs = 20
results = []

def inference():
    input_text = "The benefits of deadlifting are:"
    input_ids = tokenizer(input_text, return_tensors="pt")"cuda:0")
    start_time = time.time()
    with torch.no_grad():
        out = model.generate(input_ids=input_ids,max_length=496)
    return time.time() - start_time

for i in range(0, num_runs+1):
    duration = inference()
    if i == 0:
        print("Discarding first run")
        print(f"{i:2} run time: {duration:.4f} s")

average = sum(results) / len(results)
print(f"Average over {num_runs} runs: {average:.4f} s")

Running this test script on most 4090s I’ve tried - on Runpod, on, and also on a friend’s 4090 - gives a result like this:
Average over 20 runs: 3.3191 s

But on small small number of 4090 systems I’ve tried, I get a result like:
Average over 20 runs: 1.0022 s

Nearly 3.5 times faster! This result of 1.00s is I believe the correct result for this GPU.

The results are correlated with very different GPU usage % figures. The well-performing systems will use 60+% GPU in this small test, while a badly performing one uses only around 20%.

This small benchmark is indicative of the actual LLM inference performance of the systems. For example, testing on 7B Llama models, the well-performing 4090 systems can do float16 inference at 50 tokens/s, vs 23 tok/s for the poor ones. For 4bit GPTQ inference, the good systems achieve 95 token/s, vs 28 on the badly performing systems - a huge difference!

I also tried my test script on two other GPU types - A4000 and 3090. Amazingly, they performed exactly the same as the bad 4090 systems:

  • 3090: Average over 20 runs: 3.5412 s
  • A4000: Average over 20 runs: 3.4645 s

So whatever this bottleneck is, it’s limiting 4090s to the same performance as an A4000! Probably even worse than that, as I’ve not yet tested any GPUs weaker than an A4000.

All my tests have been run with the same Ubuntu 20.04 docker image with torch 2.0.0+cu117. So there’s no OS or CUDA toolkit differences there. I’ve also had test results from two colleagues who own 4090s, who both tested on Windows and likewise got much worse results than the 1.0 s best.

I have checked:

  • OS: as mentioned all my results, good and bad, are with an identical docker image. I’ve also had a friend try it on his 4090 in Windows and he got a bad result too (actually slightly worse, 3.8s in the test script)
  • CUDA toolkit version: tested CUDA 11.6 and 11.7, no correlation between toolkit version of good and bad performing systems
  • NV driver version - I’ve seen the same NV driver version on good and bad performing systems
  • PCIe link width - the good 4090 result shown above was PCIe x8 and most of the bad ones were x16!
  • GPU max power - all are set to 450W

Here’s a comparison of the output of nvidia-smi --query on two systems, one that performed well (in green) and one that performed poorly (in red):

(if it’s too small to read here’s the full sized image: Imgur)

The only difference that jumped out at me is that the good system has double the TX and RX throughput. But that could be a symptom of the problem, not the cause. Maybe the host system is passing it data far slower, or something.

(In this example they do have different NV drivers, but I’ve found several poorly performing systems with the same NV driver as the good performing one.)

I am baffled by this. And it seems to me it has major implications - there are potentially thousands of people out there doing local LLM inference who are getting a fraction of the performance they should be getting.

Any thoughts would be hugely appreciated!

I did an audit of systems with 4090 GPUs. I used their tag system to denote the performance on my test script:

Three out of the five 4090s scored the bad ~3.5s average in my test script, vs one system that got 1.0s and another at 1.3s.

All were using the same Docker image. All are the same GPU of course. And you can see their host details - most are PCIE 4.0 16x, but one of the good ones is PCIe x8. Mixture of AMD and Intel systems. Various RAM and CPU sizes. No obvious causation I can see at all.

On Runpod I only found one system out of 7 that got the 1.0s score. So in total out of 13 systems audited (12 x cloud GPUs running Ubuntu in Docker + 1 x Windows home PC), only three have performed well. Ten performed 3.5x - 3.8x times worse than that.

This is so confusing and weird.

1 Like

I think I figured it out. It’s depressingly mundane.

It’s bottlenecked on CPU.

The systems that perform better apparently just have much better single-core CPU performance. The one Runpod system I found that performs so well has an Intel i9-13900K, and it seems very likely that it’s able to achieve much higher single-core CPU performance vs the high-core-count AMD CPUs on the other servers I’ve tested.

I don’t suppose there’s any way to utilise CPU multi-threading in a simple call like:
output = model.generate(...) ? I’m assuming not.

Or any way to prepare the model for more efficient usage in terms of CPU and data transfer between CPU and GPU?

I’m sure batching prompts could help, but in this particular case I’m looking for single-prompt performance.

1 Like