Hi all
I am hoping that someone can help me get to the bottom of a perplexing performance problem that I’ve discovered while benchmarking language model inference using transformers + pytorch 2.0.0.
I was testing float16 inference on pytorch.bin format models, as well as 4bit quantisation with GPTQ. I don’t own an NV GPU myself, so I was using a cloud GPU provider (Runpod) with a 4090.
Long story short, I posted my benchmarks for float16 and int4 benchmarks, thinking they were fine. Someone told me that the performance seemed much too low. So I did some more digging and wrote this little test script:
import time
import torch
from transformers import AutoTokenizer,AutoModelForCausalLM
pretrained_model_dir = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
print("Loading model")
model = AutoModelForCausalLM.from_pretrained(pretrained_model_dir).eval().to("cuda:0")
print("Model loaded")
num_runs = 20
results = []
def inference():
input_text = "The benefits of deadlifting are:"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda:0")
start_time = time.time()
with torch.no_grad():
out = model.generate(input_ids=input_ids,max_length=496)
return time.time() - start_time
for i in range(0, num_runs+1):
duration = inference()
if i == 0:
print("Discarding first run")
else:
print(f"{i:2} run time: {duration:.4f} s")
results.append(duration)
average = sum(results) / len(results)
print(f"Average over {num_runs} runs: {average:.4f} s")
Running this test script on most 4090s I’ve tried - on Runpod, on Vast.ai, and also on a friend’s 4090 - gives a result like this:
Average over 20 runs: 3.3191 s
But on small small number of 4090 systems I’ve tried, I get a result like:
Average over 20 runs: 1.0022 s
Nearly 3.5 times faster! This result of 1.00s is I believe the correct result for this GPU.
The results are correlated with very different GPU usage % figures. The well-performing systems will use 60+% GPU in this small test, while a badly performing one uses only around 20%.
This small benchmark is indicative of the actual LLM inference performance of the systems. For example, testing on 7B Llama models, the well-performing 4090 systems can do float16 inference at 50 tokens/s, vs 23 tok/s for the poor ones. For 4bit GPTQ inference, the good systems achieve 95 token/s, vs 28 on the badly performing systems - a huge difference!
I also tried my test script on two other GPU types - A4000 and 3090. Amazingly, they performed exactly the same as the bad 4090 systems:
- 3090:
Average over 20 runs: 3.5412 s
- A4000:
Average over 20 runs: 3.4645 s
So whatever this bottleneck is, it’s limiting 4090s to the same performance as an A4000! Probably even worse than that, as I’ve not yet tested any GPUs weaker than an A4000.
All my tests have been run with the same Ubuntu 20.04 docker image with torch 2.0.0+cu117. So there’s no OS or CUDA toolkit differences there. I’ve also had test results from two colleagues who own 4090s, who both tested on Windows and likewise got much worse results than the 1.0 s best.
I have checked:
- OS: as mentioned all my results, good and bad, are with an identical docker image. I’ve also had a friend try it on his 4090 in Windows and he got a bad result too (actually slightly worse, 3.8s in the test script)
- CUDA toolkit version: tested CUDA 11.6 and 11.7, no correlation between toolkit version of good and bad performing systems
- NV driver version - I’ve seen the same NV driver version on good and bad performing systems
- PCIe link width - the good 4090 result shown above was PCIe x8 and most of the bad ones were x16!
- GPU max power - all are set to 450W
Here’s a comparison of the output of nvidia-smi --query
on two systems, one that performed well (in green) and one that performed poorly (in red):
(if it’s too small to read here’s the full sized image: Imgur)
The only difference that jumped out at me is that the good system has double the TX and RX throughput. But that could be a symptom of the problem, not the cause. Maybe the host system is passing it data far slower, or something.
(In this example they do have different NV drivers, but I’ve found several poorly performing systems with the same NV driver as the good performing one.)
I am baffled by this. And it seems to me it has major implications - there are potentially thousands of people out there doing local LLM inference who are getting a fraction of the performance they should be getting.
Any thoughts would be hugely appreciated!