Concurrent inference on a single GPU

Eichhof · November 18, 2021, 10:44pm

Hello

I’m building a chatbot using a transformer model (e.g., GPT 2 or BlenderBot) and I would like to let it run on a server (Windows or Linux). The server has one 11GB GPU. If there is only one inference of the chatbot model at the same time there is no problem. But if there are several concurrent calls, the calls need to be executed in sequential order which can increase the inference time. For example, when the inference takes 3 seconds and we have 10 concurrent calls, then it takes 33 seconds until the last call is processed. Theoretically, the concurrent calls could be batched for inference but usually, calls do not arise at the exactly same time.

Is there a solution to this problem for concurrent inference on a single GPU?

Eichhof · November 22, 2021, 12:08am

Does somebody have any suggestions? I’m happy about every input.

alexgrishin · November 28, 2021, 3:24am

Just some thoughts. They might or might not be helpful.

Set a parameter which would control frequency of inference calls in seconds. Let incoming requests accumulate until time for the next scheduled inference call comes. When it does, group all what came in until that time into one batch and run inference on that batch. If nothing is in queue, then just wait until next scheduled inference time. Naively one might expect if you, let’s say, set this frequency to 4 seconds, then it gives GPU enough time to process the previous request and stand ready for your next call. This way everybody gets a response in no longer than 7 seconds. Reality could be different, but this is something to play with. Hope it helps.

alexgrishin · November 28, 2021, 3:38am

Or you can try to check every one second if GPU is busy (I actually don’t know how to do it, I expect it to be super easy, but I never did it). If not, you run inference on the accumulated batch, if yes you wait another second and check again. This way the worst case scenario is the same 7 seconds, but you do not make people wait unnecessarily when load is light and the wait wasn’t needed.

Topic		Replies	Views
When I try to inference on multiple GPUs using multiple processes, the time for model. generate() becomes very long 🤗Transformers	0	476	June 12, 2023
Having issues with running parallel, independent inferences on multiple GPUs Beginners	0	246	September 10, 2024
Multiple threads of Stable diffusion Inpainting slows down the inference on same GPU 🧨 Diffusers	4	2491	March 14, 2025
GPU inference slows down if done in a loop 🤗Transformers	1	1573	July 20, 2020
API Rest with several models loaded using GPU but not at same time Beginners	1	401	June 10, 2021

Concurrent inference on a single GPU

Related topics