It appears that LLM is CPU limited regardless of my hardware of my model choice.
Regardless of how I initialize the model, my GPU is never significantly stressed (maybe 20-40% max) and one of my CPU cores is pinned at ~100% when I call generate(). I recently switched from Google’s T4 GPUs on their
n1 instances to L4 GPUs on their
g2 instances and get basically the same performance.
This happens for a load of models, T5-based, Pythia, GPT-J(T), everything. Running in half precision, full precision, using BetterTransformers all don’t really affect performance (8bit is a bit slower).
I am very confused because I don’t really see anyone asking any questions about this, yet I encounter it in every way I run these models.
# It happens regardless of how I load the model # model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto") model = AutoModelForSeq2SeqLM.from_pretrained(model_id).half().eval().cuda() # This is how I call the generate() function input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device) output_ids = model.generate(input_ids, max_new_tokens=256)
Could it be that attention is calculated on my CPU? I’ve tried profiling flan-t5-large (using .half().cpu() to make sure it’s not a device_map thing) and the profile says most time is taken up by the attention, which is classified as a
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof: run_model()