Is there any way to avoid CPU bottlenecks when doing single prompt inference?

TheBloke · May 10, 2023, 1:53pm

Following on from my thread here: Baffling performance issue on most NVidia GPUs with simple transformers + pytorch code - #3 by TheBloke

I learned that when doing single-prompt inference, for example:

tokens = model.generate(inputs=input_ids, generation_config = generation_config)[0].to(device)
response = tokenizer.decode(tokens)

or

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, generation_config=generation_config, device=device)
response = pipe(prompt)[0]["generated_text"]

I am limited by single-core CPU performance. On most servers I’ve tested on, I can’t get near to fully utilising a 4090 for example.

I know that if I were processing many prompts, I could use pipeline with batches, and this improves GPU utilisation and per-prompt performance significantly.

But is there any way I can also improve single prompt performance?

Are there any techniques that could somehow make use of multi-threading on the Python side, even for a single prompt?

My googling is coming up with nothing which makes me think there’s no way to do this, not even a complex way. But I wanted to be sure.

Thanks in advance for any help.

speedlemur · June 12, 2023, 12:59pm

I’m also running into this issue. Have not found a solution so far. Some framework must have it though - it must be possible to somehow use multi cpu to help coordinate instructions for the GPU.

Not sure if I will have time but I was thinking of looking more into the accelerate config, and/or using deepspeed

Topic		Replies	Views
Multi-gpu inference Beginners	2	880	May 14, 2024
Baffling performance issue on most NVidia GPUs with simple transformers + pytorch code Intermediate	5	4619	April 9, 2024
Limit predictions computing to single CPU core? 🤗Transformers	2	3239	December 6, 2023
GPU inference slows down if done in a loop 🤗Transformers	1	1585	July 20, 2020
Beam_search bottlenecks inference with only 1 used cpu 🤗Transformers	1	843	October 13, 2022

Is there any way to avoid CPU bottlenecks when doing single prompt inference?

Related topics