Hi community,
I am messing around with this example (DeepSpeed) and the full code is here (https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py) and have a question regarding memory usage.
Specifically, I was interested in how much ZeRO3 was able to reduce memory usage compared to running the same training script 1) on 1 GPU vs 2) using Pytorch DDP (example: From PyTorch DDP to Accelerate to Trainer, mastery of distributed training with ease).
I am using 2 * 2080ti and observe the following,
- If I run
python peft_lora_seq2seq_accelerate_ds_zero3_offload.py
, my first GPU roughly uses 2500-2900MB RAM; - If I run
torchrun --nproc-per-node=2 peft_lora_seq2seq_accelerate_ds_zero3_offload.py
, both of my GPUs use 2500-2900MB RAM; which makes sense as DDP just replicates the entire model on both cards; - If I run
accelerate launch --config_file ds_zero3_cpu.yaml peft_lora_seq2seq_accelerate_ds_zero3_offload.py
both my GPUs usage are constantly above 3000MB sometimes 5000MB.
I feel like I must have missed something basic here but not quite sure where to look for the problem so any pointers appreciated!
Thanks,