Failed to increase batch size when using multi gpu

tomad01 · April 7, 2023, 7:56am

Hi,

I have a machine with 8 Tesla V100.
When I train a model with ORTTrainer (this also happens with Trainer api from transformers) api from optimum.onnxruntime on a single gpu (os.environ[“CUDA_VISIBLE_DEVICES”] = “0”
) I’m able to use 32 batch size, however when I’m running a train on all gpus I get an oom error with bs=32. I was expecting to be able to increase the bs a lot with 8 gpus.
Does the ORTTrainer api not handling well the parallelization ?
I noticed one of the gpu has ~99% memory used while others have less, around 70%
Do you have any suggestions for me?
Thanks

Topic		Replies	Views
Albert Pre-training with Batch size 8 is throwing OOM 🤗Transformers	0	370	January 12, 2022
Using 3 GPUs for training with Trainer() of transformers 🤗Transformers	2	2339	October 18, 2023
Model training in Multi GPU 🤗Transformers	1	1832	March 17, 2021
Trainer use multigpu 🤗Transformers	0	510	July 29, 2021
Loading extra memory in GPU 0 using DDP Intermediate	0	391	June 18, 2023

Failed to increase batch size when using multi gpu

Related topics