Generation is always CPU limited

Vinno97 · April 21, 2023, 3:13pm

It appears that LLM is CPU limited regardless of my hardware of my model choice.

Regardless of how I initialize the model, my GPU is never significantly stressed (maybe 20-40% max) and one of my CPU cores is pinned at ~100% when I call generate(). I recently switched from Google’s T4 GPUs on their n1 instances to L4 GPUs on their g2 instances and get basically the same performance.

This happens for a load of models, T5-based, Pythia, GPT-J(T), everything. Running in half precision, full precision, using BetterTransformers all don’t really affect performance (8bit is a bit slower).

I am very confused because I don’t really see anyone asking any questions about this, yet I encounter it in every way I run these models.

# It happens regardless of how I load the model
# model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id).half().eval().cuda()

# This is how I call the generate() function
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
output_ids = model.generate(input_ids, max_new_tokens=256)

Could it be that attention is calculated on my CPU? I’ve tried profiling flan-t5-large (using .half().cpu() to make sure it’s not a device_map thing) and the profile says most time is taken up by the attention, which is classified as a cpu_op.

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    run_model()

Topic		Replies	Views
Run pre-trained LLM model on CPU - ValueError: Expected a cuda device, but got: cpu Beginners	0	334	April 17, 2024
Offloading LLM models to CPU uses only single core 🤗Transformers	1	3992	June 3, 2024
Deploying LLM in Production: Performance Degradation with Multiple Users 🤗Transformers	6	4748	June 7, 2024
Query execution with hugging face pipeline is happening on CPU, even if model is loaded on GPU 🤗Transformers	0	968	May 30, 2023
Inference with hugging face pipeline happening on CPU, even if model is loaded on GPU 🤗Transformers	0	1705	May 30, 2023

Generation is always CPU limited

Related topics