The last time this question was asked it was in 2020, and I can’t seem to how out how to do this. Is it possible?
The bottleneck of generation is the model forward pass, so being able to run the model forward pass in multiple GPUs should do it.
I’m not knowledgeable about multi-GPU inference, especially in PyTorch, maybe @sgugger knows how to do it
(the answer will also be useful to me )
Check out FastTransformer, a library focusing on large models, spanning many GPUs and nodes in a distributed manner.
I think you can use a pip module called accelerate. I believe this module can also train on multiple machines. I don’t know if they need to be heterogeneous. I use it to switch from CPU to MPS on my Mac. It has options for multi-gpu.