Why is it that when I use Trainer, multiple GPUs are used for training, but only one GPU is used for evaluation? When I compared the GPU usage for training and evaluation, I found that: only the memory of GPU-0 is increased, and only its GPU-util is not 0.
This causes per_device_eval_batch_size
to be only 1 or it goes OOM. And causing the evaluation to be slow.
4 Likes
Have the same exact issue. Have you come to any conclusions / managed to proceed?
Not yet
Any update on this? I am getting OOM for the same thing, only cuda:0 is being used…
I have heard about DataParallel or DistributedDataParallel, but this appears to require pretty extensive refactoring of my training script…