Can't use multi GPU in evaluation from Trainer

betty12 · September 12, 2023, 4:14pm

Why is it that when I use Trainer, multiple GPUs are used for training, but only one GPU is used for evaluation? When I compared the GPU usage for training and evaluation, I found that: only the memory of GPU-0 is increased, and only its GPU-util is not 0.
This causes per_device_eval_batch_size to be only 1 or it goes OOM. And causing the evaluation to be slow.

zuzannad1 · September 14, 2023, 9:35am

Have the same exact issue. Have you come to any conclusions / managed to proceed?

betty12 · September 18, 2023, 7:22am

Not yet

capnchat · December 6, 2023, 5:16pm

Any update on this? I am getting OOM for the same thing, only cuda:0 is being used…

I have heard about DataParallel or DistributedDataParallel, but this appears to require pretty extensive refactoring of my training script…

Topic		Replies	Views
Why is Trainer only using 1 (not 4) GPUs? Beginners	1	1589	June 2, 2022
Trainer is not using multiple GPUs in the DP setup Beginners	0	817	April 9, 2023
Trainer.evalute() with multi GPUs results Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! Beginners	2	84	February 11, 2025
Setting up separate device for validation in Trainer? 🤗Transformers	0	100	April 6, 2024
Using 3 GPUs for training with Trainer() of transformers 🤗Transformers	2	2300	October 18, 2023

Can't use multi GPU in evaluation from Trainer

Related topics