Multi-GPU Training using Accelerate: RAM Issue Leading to Failure

kkksklsn · July 16, 2024, 2:09am

I am currently using accelerate for multi-GPU training. Running python train.py on a single GPU works fine. However, when I execute the following command, the RAM usage keeps increasing until it eventually fails:

CUDA_VISIBLE_DEVICES=2,3 accelerate launch --num_processes 2 train.py

Execution Environment:
accelerate : 0.28.0
python : 3.8.10
cuda : nvcc -V : 12.1 nvidia-smi 12.0
pytorch : 2.1.0+cu121

and I use in docker.

Topic		Replies	Views
Multi-GPU Training sometimes working with 2GPU, but never more than 2 🤗Accelerate	5	3006	August 8, 2024
Accelerate on single GPU doesnt seem to work Beginners	2	5509	May 16, 2023
Cuda Out of Memory with Multi-GPU Accelerate for gemma-2b 🤗Accelerate	1	130	December 22, 2024
Clear Cache with Accelerate 🤗Accelerate	3	6921	May 5, 2023
Accelerate Multi-GPU on several Nodes How to 🤗Accelerate	3	6315	October 13, 2021

Multi-GPU Training using Accelerate: RAM Issue Leading to Failure

Related topics