Parallelise pipelines on a single GPU?

Hello everyone,
I am looking to create a REST API with Whisper and a GPU. My goal is to enable live speech-to-text. I managed to do this with a local server on my machine, and now I want multiple people to be able to use Whisper simultaneously. I am therefore looking to store my model in a pipeline and have multiple inferences processed simultaneously on the same GPU. Currently, I am using a queue and processing the inferences one after the other.
If I haven’t been clear enough or if you need more details, please don’t hesitate to ask.
Thank you in advance to everyone.

1 Like

After conducting some tests, we can run parallel pipelines on a single GPU. However, we need to initialize and store more than one model in the GPU. If we try to use the same model with two different processes at the same time, the results may not be accurate, and the model won’t know how to respond. The code I am using is:

def compute(pipe):
    res = pipe.inference("test.wav","")
    print(res)

if __name__ == '__main__':
    pipe = whisperModel('openai/whisper-tiny','cuda',torch.float16)
    print(pipe.inference("test.wav",""))
    p = Pool(2)
    with p:
        p.map(compute,[pipe,pipe])

the first result is :
Alors ici la fois, j’aime mon travail, c’est comme ça, si la vie on aime tous notre travail, oui, dites-moi un as en plus, je vous remercie.
but the parallelism result are :
!!!
and
!!!
But if you init 2 differente pipeline and you do inference at the same time on each of them in the same GPU it works.

1 Like

For now this is the only thing working for me.

from whisperModel import whisperModel
import torch
import multiprocessing as mp

def inference(datas):
    pipe = whisperModel('openai/whisper-large-v3',"cuda",torch.float32)    
    res = pipe.inference(datas)
    print(res["text"])

if __name__ == '__main__':
    datas = "some amazing datas"
    for i in range(2):
        p = mp.Process(target=inference,args=(datas,))
        p.start()
1 Like

Python’s GIL restrictions are crazy strict…
If the process is separable, it may be faster to separate it into a single .py file and subprocess.run() it. Well, it is a matter of whether multiprocessing is done in the OS or in the shell, or in Python.
This method cannot be used for interrelated processes, though…