More processes than GPUs with DeepSpeed launcher


I am training LoRA adaptation of a T5 model in a one-machine multiple GPU setup.
I am using Transformers 4.26.1 and DeepSpeed 0.9.2 and launching my script with deepspeed (thus the parallelization setup is Distributed Data Parallel).

Just before training, I first get four processes in my first GPU (probably the four models loaded in right?)

When the training starts, I have four processes running in my first GPU and then one process for each other GPU. Although I am using regular PyTorch I have the same issue as in this repo : extra process when running ddp across multiple GPUs · Lightning-AI/pytorch-lightning · Discussion #9864 · GitHub.

See image

Is it normal? I have only seen examples with one process per GPU. I would be greatly interested in an explanation.

Thanks !