Multiple gpu not properly parallelized during model.generate()

jspark93 · March 22, 2022, 7:47am

Hi,

I am currently working on transformers ver 4.15.0.

I’m using model.generate() with beam number of 4 for the inference.

However, it seems that the generation process is not properly parallelized over GPUs that I have.

Is there a way to parallelize the generation process while using beam search?

Thank you

jdwx · March 22, 2022, 8:21pm

I may very well be wrong about this, but I don’t think that’s possible. Four beams is the best four results from a single inference, not four separate inferences. And I don’t think HuggingFace is designed to support multiple GPUs for a single inference. You’d have to shuttle a bunch of data back and forth between GPUs to make that work, which would be really slow. While possible, I’d be surprised if the overhead didn’t make it slower than just doing it on one GPU.

Usually I’ve only seen multiple GPUs used for inference in a batch setting with lots of inferences to perform. Even then, each GPU gets its own copy of the model but they all do a single inference at a time, just like you are currently doing.

I would love to be wrong though.

jspark93 · March 24, 2022, 6:33am

Thank you for the answer. I think you have a point.
Well, then it seems that I have to split the dataset into multiple shards, and then run separate processes for each sharded dataset.

Thank you again!

jdwx · March 24, 2022, 3:49pm

If you have a big dataset you need to do inference on (rather than just wanting single generation to go faster), you may want to look into Deepspeed. It works quite well with HuggingFace and now supports batch inference across multiple GPUs, not just training. Might save you a lot of trouble.

shivaays · October 9, 2022, 4:19am

Hi @jdwx ,
Can you please share a script or guide me with a link that I can get help in multi GPU inference. I have trained a T5 model, and want to do multi-GPU inferencing, where I can load a pretrained model and do inferencing on 4 GPU.

I have tried deepspeed but facing error with it. Can you please share something which can load a pretrained T5 model?

Topic		Replies	Views
Model.generate() is extremely slow while using beam search 🤗Transformers	2	5404	July 24, 2022
When I try to inference on multiple GPUs using multiple processes, the time for model. generate() becomes very long 🤗Transformers	0	476	June 12, 2023
Data Parallelism for multi-GPUs Inference Intermediate	0	549	October 26, 2022
[deepspeed] bigscience/T0* multi-gpu text generation Intermediate	0	476	September 8, 2022
How to parallelize model.generate? 🤗Transformers	1	807	September 7, 2022

Multiple gpu not properly parallelized during model.generate()

Related topics