Multiple gpu not properly parallelized during model.generate()


I am currently working on transformers ver 4.15.0.

I’m using model.generate() with beam number of 4 for the inference.

However, it seems that the generation process is not properly parallelized over GPUs that I have.

Is there a way to parallelize the generation process while using beam search?

Thank you

I may very well be wrong about this, but I don’t think that’s possible. Four beams is the best four results from a single inference, not four separate inferences. And I don’t think HuggingFace is designed to support multiple GPUs for a single inference. You’d have to shuttle a bunch of data back and forth between GPUs to make that work, which would be really slow. While possible, I’d be surprised if the overhead didn’t make it slower than just doing it on one GPU.

Usually I’ve only seen multiple GPUs used for inference in a batch setting with lots of inferences to perform. Even then, each GPU gets its own copy of the model but they all do a single inference at a time, just like you are currently doing.

I would love to be wrong though.

Thank you for the answer. I think you have a point.
Well, then it seems that I have to split the dataset into multiple shards, and then run separate processes for each sharded dataset.

Thank you again!

If you have a big dataset you need to do inference on (rather than just wanting single generation to go faster), you may want to look into Deepspeed. It works quite well with HuggingFace and now supports batch inference across multiple GPUs, not just training. Might save you a lot of trouble.

Hi @jdwx ,
Can you please share a script or guide me with a link that I can get help in multi GPU inference. I have trained a T5 model, and want to do multi-GPU inferencing, where I can load a pretrained model and do inferencing on 4 GPU.

I have tried deepspeed but facing error with it. Can you please share something which can load a pretrained T5 model?