I’m building a chatbot using a transformer model (e.g., GPT 2 or BlenderBot) and I would like to let it run on a server (Windows or Linux). The server has one 11GB GPU. If there is only one inference of the chatbot model at the same time there is no problem. But if there are several concurrent calls, the calls need to be executed in sequential order which can increase the inference time. For example, when the inference takes 3 seconds and we have 10 concurrent calls, then it takes 33 seconds until the last call is processed. Theoretically, the concurrent calls could be batched for inference but usually, calls do not arise at the exactly same time.
Is there a solution to this problem for concurrent inference on a single GPU?
Does somebody have any suggestions? I’m happy about every input.
Just some thoughts. They might or might not be helpful.
Set a parameter which would control frequency of inference calls in seconds. Let incoming requests accumulate until time for the next scheduled inference call comes. When it does, group all what came in until that time into one batch and run inference on that batch. If nothing is in queue, then just wait until next scheduled inference time. Naively one might expect if you, let’s say, set this frequency to 4 seconds, then it gives GPU enough time to process the previous request and stand ready for your next call. This way everybody gets a response in no longer than 7 seconds. Reality could be different, but this is something to play with. Hope it helps.
Or you can try to check every one second if GPU is busy (I actually don’t know how to do it, I expect it to be super easy, but I never did it). If not, you run inference on the accumulated batch, if yes you wait another second and check again. This way the worst case scenario is the same 7 seconds, but you do not make people wait unnecessarily when load is light and the wait wasn’t needed.