I found a similar topic on this forum but didn’t find any helpful answers (Saving checkpoint is too slow with deepspeed).
I’m currently using DeepSpeed with a custom HuggingFace Trainer. The checkpoint saving takes around 10-12 minutes without DeepSpeed. With DeepSpeed it’s taking more than 25 minutes (I had to force-quit my process - it might take longer).
Is this normal behavior? I’m currently using ZeRO-0.