After conducting some tests, we can run parallel pipelines on a single GPU. However, we need to initialize and store more than one model in the GPU. If we try to use the same model with two different processes at the same time, the results may not be accurate, and the model won’t know how to respond. The code I am using is:
def compute(pipe):
res = pipe.inference("test.wav","")
print(res)
if __name__ == '__main__':
pipe = whisperModel('openai/whisper-tiny','cuda',torch.float16)
print(pipe.inference("test.wav",""))
p = Pool(2)
with p:
p.map(compute,[pipe,pipe])
the first result is :
Alors ici la fois, j’aime mon travail, c’est comme ça, si la vie on aime tous notre travail, oui, dites-moi un as en plus, je vous remercie.
but the parallelism result are :
!!!
and
!!!
But if you init 2 differente pipeline and you do inference at the same time on each of them in the same GPU it works.