DeepSpeed Zero causes intermittent GPU usage

I am not sure how to explain that but I am finally getting much better results with DeepSpeed.

The recipe that worked for me:

updating the config to the one I linked didn’t bring any improvement alone

adding fused AdamW brought some improvement

adding gradient checkpointing made the overall training fantastic, i.e. no more intermittent GPU usage, greatly reduced memory consumption and fast training given the context length

this combo brought significant improvement over my best previous setup which was only fused AdamW and gradient checkpointing

and doing the same with Zero 2 did not yield the same improvement

I am not sure how transferable this is to e.g. another model, but through trial and error I finally managed to get something good for my training… good luck!

1 Like