Following on from my thread here: Baffling performance issue on most NVidia GPUs with simple transformers + pytorch code - #3 by TheBloke
I learned that when doing single-prompt inference, for example:
tokens = model.generate(inputs=input_ids, generation_config = generation_config).to(device) response = tokenizer.decode(tokens)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, generation_config=generation_config, device=device) response = pipe(prompt)["generated_text"]
I am limited by single-core CPU performance. On most servers I’ve tested on, I can’t get near to fully utilising a 4090 for example.
I know that if I were processing many prompts, I could use
pipeline with batches, and this improves GPU utilisation and per-prompt performance significantly.
But is there any way I can also improve single prompt performance?
Are there any techniques that could somehow make use of multi-threading on the Python side, even for a single prompt?
My googling is coming up with nothing which makes me think there’s no way to do this, not even a complex way. But I wanted to be sure.
Thanks in advance for any help.