Model's evaluation in DDP training is using only one GPU

Hello! :smiley:
I am training an HF model with torch DDP using the following command line:

python -m torch.distributed.launch --nproc_per_node 2 --{arguments}

I noticed that while training was using the two available GPUs, the evaluation step was done only on a single GPU. After checking the source code, it seems that > here the model is not wrapped inside the DDP when training==False.

Is it expected that only one GPU will be used during the evaluation step? If yes, could you explain why the DDP cannot be used for the evaluation as well?


Have you figured out how to use multiple GPUs for the eval loop during training? Am facing the same issue.