I am not sure how to explain that but I am finally getting much better results with DeepSpeed.
The recipe that worked for me:
- use the example configs provided in transformers/tests/deepspeed at main · huggingface/transformers · GitHub
- use fused AdamW (I am using Pytorch implem.)
- use gradient checkpointing together with DeepSpeed Zero3
updating the config to the one I linked didn’t bring any improvement alone
adding fused AdamW brought some improvement
adding gradient checkpointing made the overall training fantastic, i.e. no more intermittent GPU usage, greatly reduced memory consumption and fast training given the context length
this combo brought significant improvement over my best previous setup which was only fused AdamW and gradient checkpointing
and doing the same with Zero 2 did not yield the same improvement
I am not sure how transferable this is to e.g. another model, but through trial and error I finally managed to get something good for my training… good luck!