Saving checkpoints when using DeepSpeed is taking abnormally long

I found a similar topic on this forum but didn’t find any helpful answers (Saving checkpoint is too slow with deepspeed).

I’m currently using DeepSpeed with a custom HuggingFace Trainer. The checkpoint saving takes around 10-12 minutes without DeepSpeed. With DeepSpeed it’s taking more than 25 minutes (I had to force-quit my process - it might take longer).

Is this normal behavior? I’m currently using ZeRO-0.

1 Like