ZeRO uses more RAM than DDP?

Hi community,

I am messing around with this example (DeepSpeed) and the full code is here (https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py) and have a question regarding memory usage.

Specifically, I was interested in how much ZeRO3 was able to reduce memory usage compared to running the same training script 1) on 1 GPU vs 2) using Pytorch DDP (example: From PyTorch DDP to Accelerate to Trainer, mastery of distributed training with ease).

I am using 2 * 2080ti and observe the following,

  1. If I run python peft_lora_seq2seq_accelerate_ds_zero3_offload.py, my first GPU roughly uses 2500-2900MB RAM;
  2. If I run torchrun --nproc-per-node=2 peft_lora_seq2seq_accelerate_ds_zero3_offload.py, both of my GPUs use 2500-2900MB RAM; which makes sense as DDP just replicates the entire model on both cards;
  3. If I run accelerate launch --config_file ds_zero3_cpu.yaml peft_lora_seq2seq_accelerate_ds_zero3_offload.py both my GPUs usage are constantly above 3000MB sometimes 5000MB.

I feel like I must have missed something basic here but not quite sure where to look for the problem so any pointers appreciated!

Thanks,