Model's evaluation in DDP training is using only one GPU

sararb · January 10, 2023, 12:49pm

Hello!
I am training an HF model with torch DDP using the following command line:

python -m torch.distributed.launch --nproc_per_node 2 my_script.py --{arguments}

I noticed that while training was using the two available GPUs, the evaluation step was done only on a single GPU. After checking the source code, it seems that > here the model is not wrapped inside the DDP when training==False.

Is it expected that only one GPU will be used during the evaluation step? If yes, could you explain why the DDP cannot be used for the evaluation as well?

zuzannad1 · September 14, 2023, 9:02am

Have you figured out how to use multiple GPUs for the eval loop during training? Am facing the same issue.

Topic		Replies	Views
Can't use multi GPU in evaluation from Trainer 🤗Transformers	3	1011	December 6, 2023
Custom trainer evaluation function Intermediate	0	2838	June 20, 2022
Trainer is not using multiple GPUs in the DP setup Beginners	0	832	April 9, 2023
Fine tunning GPT-2 by using multiple GPUs ( ddp pytorch ) 🤗Transformers	0	1158	May 9, 2023
Batch size in trainer eval loop DeepSpeed	3	4603	April 22, 2022

Model's evaluation in DDP training is using only one GPU

Related topics