Is there any way to avoid CPU bottlenecks when doing single prompt inference?

Following on from my thread here: Baffling performance issue on most NVidia GPUs with simple transformers + pytorch code - #3 by TheBloke

I learned that when doing single-prompt inference, for example:

tokens = model.generate(inputs=input_ids, generation_config = generation_config)[0].to(device)
response = tokenizer.decode(tokens)


pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, generation_config=generation_config, device=device)
response = pipe(prompt)[0]["generated_text"]

I am limited by single-core CPU performance. On most servers I’ve tested on, I can’t get near to fully utilising a 4090 for example.

I know that if I were processing many prompts, I could use pipeline with batches, and this improves GPU utilisation and per-prompt performance significantly.

But is there any way I can also improve single prompt performance?

Are there any techniques that could somehow make use of multi-threading on the Python side, even for a single prompt?

My googling is coming up with nothing which makes me think there’s no way to do this, not even a complex way. But I wanted to be sure.

Thanks in advance for any help.

I’m also running into this issue. Have not found a solution so far. Some framework must have it though - it must be possible to somehow use multi cpu to help coordinate instructions for the GPU.

Not sure if I will have time but I was thinking of looking more into the accelerate config, and/or using deepspeed