Multi-GPU Training using Accelerate: RAM Issue Leading to Failure

I am currently using accelerate for multi-GPU training. Running python train.py on a single GPU works fine. However, when I execute the following command, the RAM usage keeps increasing until it eventually fails:

CUDA_VISIBLE_DEVICES=2,3 accelerate launch --num_processes 2 train.py

Execution Environment:
accelerate : 0.28.0
python : 3.8.10
cuda : nvcc -V : 12.1 nvidia-smi 12.0
pytorch : 2.1.0+cu121

and I use in docker.

1 Like