I’m building a chatbot using a transformer model (e.g., GPT 2 or BlenderBot) and I would like to let it run on a server (Windows or Linux). The server has one 11GB GPU. If there is only one inference of the chatbot model at the same time there is no problem. But if there are several concurrent calls, the calls need to be executed in sequential order which can increase the inference time. For example, when the inference takes 3 seconds and we have 10 concurrent calls, then it takes 33 seconds until the last call is processed. Theoretically, the concurrent calls could be batched for inference but usually, calls do not arise at the exactly same time.
Is there a solution to this problem for concurrent inference on a single GPU?