Accelerate is out of RAM

Neel-Gupta · August 19, 2022, 11:09am

Hi, I have an instance of 8x A100s, 1.1 TB RAM however, accelerate launch isn’t able to run scripts on all 8 GPUs, only being able to handle 6 Processes.

accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=6 \
scripts/torch_convnext.py \
--model_name='convnext_large' --batch_size=64 --epochs=10 \
--lr=6e-5 --pretrained='imagenet' --optimize='AdamW'

If I push num_processes > 6, then I get subprocess ‘Killed’ error, indicating all RAM has been used up.

Any way I can utilize all 8 of my GPUs and prevent RAM overflowing?

Code: accelerator = Accelerator(log_with='wandb')

Neel-Gupta · August 20, 2022, 11:14am

Used CUDA 11.7 w/ PyTorch. Since the nightly isn’t out yet, I used the official Nvidia Docker image

Topic		Replies	Views
Multi-GPU Training using Accelerate: RAM Issue Leading to Failure 🤗Accelerate	0	94	July 16, 2024
`Accelerator.prepare` utilize only one GPU instead of all the 8 available GPUs and raises "CUDA out of memory" 🤗Accelerate	3	2855	July 19, 2024
[Kaggle] TPUVM doesn't allow setting nprocs > 1 🤗Accelerate	1	1004	April 9, 2023
HF Accelerate uses multiple GPUs even when setting `num_processes` to 1 🤗Accelerate	0	88	August 2, 2024
Multi-GPU inference with accelerate Beginners	0	1716	October 19, 2023

Accelerate is out of RAM

Related topics