I am training LoRA adaptation of a T5 model in a one-machine multiple GPU setup.
I am using Transformers 4.26.1 and DeepSpeed 0.9.2 and launching my script with deepspeed (thus the parallelization setup is Distributed Data Parallel).
Just before training, I first get four processes in my first GPU (probably the four models loaded in right?)
When the training starts, I have four processes running in my first GPU and then one process for each other GPU. Although I am using regular PyTorch I have the same issue as in this repo : extra process when running ddp across multiple GPUs · Lightning-AI/pytorch-lightning · Discussion #9864 · GitHub.
Is it normal? I have only seen examples with one process per GPU. I would be greatly interested in an explanation.