Hi, I have an instance of 8x A100s
, 1.1 TB
RAM however, accelerate launch
isn’t able to run scripts on all 8 GPUs, only being able to handle 6 Processes.
accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=6 \
scripts/torch_convnext.py \
--model_name='convnext_large' --batch_size=64 --epochs=10 \
--lr=6e-5 --pretrained='imagenet' --optimize='AdamW'
If I push num_processes
> 6, then I get subprocess ‘Killed’ error, indicating all RAM has been used up.
Any way I can utilize all 8 of my GPUs and prevent RAM overflowing?