Parallelise pipelines on a single GPU?

Sarwg · October 22, 2024, 2:54pm

Hello everyone,
I am looking to create a REST API with Whisper and a GPU. My goal is to enable live speech-to-text. I managed to do this with a local server on my machine, and now I want multiple people to be able to use Whisper simultaneously. I am therefore looking to store my model in a pipeline and have multiple inferences processed simultaneously on the same GPU. Currently, I am using a queue and processing the inferences one after the other.
If I haven’t been clear enough or if you need more details, please don’t hesitate to ask.
Thank you in advance to everyone.

Sarwg · October 29, 2024, 9:11am

After conducting some tests, we can run parallel pipelines on a single GPU. However, we need to initialize and store more than one model in the GPU. If we try to use the same model with two different processes at the same time, the results may not be accurate, and the model won’t know how to respond. The code I am using is:

def compute(pipe):
    res = pipe.inference("test.wav","")
    print(res)

if __name__ == '__main__':
    pipe = whisperModel('openai/whisper-tiny','cuda',torch.float16)
    print(pipe.inference("test.wav",""))
    p = Pool(2)
    with p:
        p.map(compute,[pipe,pipe])

the first result is :
Alors ici la fois, j’aime mon travail, c’est comme ça, si la vie on aime tous notre travail, oui, dites-moi un as en plus, je vous remercie.
but the parallelism result are :
!!!
and
!!!
But if you init 2 differente pipeline and you do inference at the same time on each of them in the same GPU it works.

Sarwg · October 31, 2024, 11:16am

For now this is the only thing working for me.

from whisperModel import whisperModel
import torch
import multiprocessing as mp

def inference(datas):
    pipe = whisperModel('openai/whisper-large-v3',"cuda",torch.float32)    
    res = pipe.inference(datas)
    print(res["text"])

if __name__ == '__main__':
    datas = "some amazing datas"
    for i in range(2):
        p = mp.Process(target=inference,args=(datas,))
        p.start()

John6666 · October 31, 2024, 12:48pm

Python’s GIL restrictions are crazy strict…
If the process is separable, it may be faster to separate it into a single .py file and subprocess.run() it. Well, it is a matter of whether multiprocessing is done in the OS or in the shell, or in Python.
This method cannot be used for interrelated processes, though…

Topic		Replies	Views
Running multiple pipelines concurrently 🤗Transformers	0	779	October 19, 2021
Pipeline inference with multi gpus Beginners	0	609	March 13, 2022
Deploying Whisper Based Live Transcription for 1000 Concurrent users Intermediate	0	345	June 1, 2024
Is there any way to avoid CPU bottlenecks when doing single prompt inference? Intermediate	1	969	June 12, 2023
API Rest with several models loaded using GPU but not at same time Beginners	1	401	June 10, 2021

Parallelise pipelines on a single GPU?

Related topics