Failed to increase batch size when using multi gpu


I have a machine with 8 Tesla V100.
When I train a model with ORTTrainer (this also happens with Trainer api from transformers) api from optimum.onnxruntime on a single gpu (os.environ[“CUDA_VISIBLE_DEVICES”] = “0”
) I’m able to use 32 batch size, however when I’m running a train on all gpus I get an oom error with bs=32. I was expecting to be able to increase the bs a lot with 8 gpus.
Does the ORTTrainer api not handling well the parallelization ?
I noticed one of the gpu has ~99% memory used while others have less, around 70%
Do you have any suggestions for me?