When I try to inference on multiple GPUs using multiple processes, the time for model. generate() becomes very long

jerma66 · June 12, 2023, 12:06pm

I started multiple processes using subprocess, each process obtaining a separate portion of data for inference on a separate gpu (model. generate()). But strangely, when doing so, the inference speed is much slower than in the case of a single process, and the utilization rate of the GPU is also very low. I printed the runtime and found that most of the time was brought by model. generate().

Topic		Replies	Views
Continuous execution lead to decreasing inference time Beginners	0	17	October 28, 2024
GPU inference slows down if done in a loop 🤗Transformers	1	1569	July 20, 2020
Data Parallelism for multi-GPUs Inference Intermediate	0	548	October 26, 2022
How to do distributed Inference for large models with multiprocess? 🤗Accelerate	3	633	May 26, 2024
Concurrent inference on a single GPU Beginners	3	2506	November 28, 2021

When I try to inference on multiple GPUs using multiple processes, the time for model. generate() becomes very long

Related topics