Hello,
I’m trying a 4 GPU training, using this code:
python -m torch.distributed.launch --nproc_per_node = 4 run_translation.py --other_args
Using that code, it seems that with per_device_train_batch_size=3
I have the cuda Out of memory
error, using per_device_train_batch_size=2
instead I don’t get the error.
But if I train on a single GPU, with per_device_train_batch_size=3
, I have no memory error.
I have such a small batch size because each GPU has 15GB of RAM and I have inputs of len 1024.
I also tried these lines of code but nothing has changed:
torch.cuda.set_device(rank)
torch.cuda.empty_cache()
Does distributed training use more GPU memory?
I’m only using the code above, is there anything else I should use?
Thanks