ZeRO uses more RAM than DDP?

Hi community,

I am messing around with this example (DeepSpeed) and the full code is here ( and have a question regarding memory usage.

Specifically, I was interested in how much ZeRO3 was able to reduce memory usage compared to running the same training script 1) on 1 GPU vs 2) using Pytorch DDP (example: From PyTorch DDP to Accelerate to Trainer, mastery of distributed training with ease).

I am using 2 * 2080ti and observe the following,

  1. If I run python, my first GPU roughly uses 2500-2900MB RAM;
  2. If I run torchrun --nproc-per-node=2, both of my GPUs use 2500-2900MB RAM; which makes sense as DDP just replicates the entire model on both cards;
  3. If I run accelerate launch --config_file ds_zero3_cpu.yaml both my GPUs usage are constantly above 3000MB sometimes 5000MB.

I feel like I must have missed something basic here but not quite sure where to look for the problem so any pointers appreciated!