ZeRO uses more RAM than DDP?

Kenkentron · August 7, 2023, 2:46pm

Hi community,

Specifically, I was interested in how much ZeRO3 was able to reduce memory usage compared to running the same training script 1) on 1 GPU vs 2) using Pytorch DDP (example: From PyTorch DDP to Accelerate to Trainer, mastery of distributed training with ease).

I am using 2 * 2080ti and observe the following,

If I run python peft_lora_seq2seq_accelerate_ds_zero3_offload.py, my first GPU roughly uses 2500-2900MB RAM;
If I run torchrun --nproc-per-node=2 peft_lora_seq2seq_accelerate_ds_zero3_offload.py, both of my GPUs use 2500-2900MB RAM; which makes sense as DDP just replicates the entire model on both cards;
If I run accelerate launch --config_file ds_zero3_cpu.yaml peft_lora_seq2seq_accelerate_ds_zero3_offload.py both my GPUs usage are constantly above 3000MB sometimes 5000MB.

I feel like I must have missed something basic here but not quite sure where to look for the problem so any pointers appreciated!

Thanks,

Topic		Replies	Views
2B Model Fill Up Memory Usage on 4xA100s 🤗Transformers	1	138	April 10, 2025
CUDA Memory with DeepSpeed running on 4 GPUs is the same as 1 GPU DeepSpeed	0	1092	September 13, 2021
DeepSpeed Zero causes intermittent GPU usage 🤗Accelerate	1	466	December 19, 2024
11B model gets OOM after using deepspeed zero 3 setting with 8 32G V100 🤗Accelerate	2	1344	April 26, 2025
Run_mlm.py using --sharded_ddp "zero_dp_3 offload" gives AssertionError Intermediate	3	1174	April 21, 2021