When I try to inference on multiple GPUs using multiple processes, the time for model. generate() becomes very long

I started multiple processes using subprocess, each process obtaining a separate portion of data for inference on a separate gpu (model. generate()). But strangely, when doing so, the inference speed is much slower than in the case of a single process, and the utilization rate of the GPU is also very low. I printed the runtime and found that most of the time was brought by model. generate().