Model's evaluation in DDP training is using only one GPU

Hello! :smiley:
I am training an HF model with torch DDP using the following command line:

python -m torch.distributed.launch --nproc_per_node 2 my_script.py --{arguments}

I noticed that while training was using the two available GPUs, the evaluation step was done only on a single GPU. After checking the source code, it seems that > here the model is not wrapped inside the DDP when training==False.

Is it expected that only one GPU will be used during the evaluation step? If yes, could you explain why the DDP cannot be used for the evaluation as well?

3 Likes

Have you figured out how to use multiple GPUs for the eval loop during training? Am facing the same issue.

1 Like