Generation is always CPU limited

It appears that LLM is CPU limited regardless of my hardware of my model choice.

Regardless of how I initialize the model, my GPU is never significantly stressed (maybe 20-40% max) and one of my CPU cores is pinned at ~100% when I call generate(). I recently switched from Google’s T4 GPUs on their n1 instances to L4 GPUs on their g2 instances and get basically the same performance.

This happens for a load of models, T5-based, Pythia, GPT-J(T), everything. Running in half precision, full precision, using BetterTransformers all don’t really affect performance (8bit is a bit slower).

I am very confused because I don’t really see anyone asking any questions about this, yet I encounter it in every way I run these models.

# It happens regardless of how I load the model
# model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id).half().eval().cuda()

# This is how I call the generate() function
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
output_ids = model.generate(input_ids, max_new_tokens=256)

Could it be that attention is calculated on my CPU? I’ve tried profiling flan-t5-large (using .half().cpu() to make sure it’s not a device_map thing) and the profile says most time is taken up by the attention, which is classified as a cpu_op.

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof: