The same hyperparameters with deepspeed is worse than without deepseepd

I’m training a model (dpcnn) with batch size 512 with out deepspeed, the actual batch size is equal to (512/8)=64 (I have 8 gpus) with deepspeed, but the deepspeed’s loss and accuracy is far worse than without deepspeed. The two experiments have same code, same hyperparameters except batch size. Any one have some idea to explain this situation?

Hello,

We need a small script that we can run to reproduce the behaviour. As per our experiments, it works as expected. For your reference, please go through this blog: Accelerate Large Model Training using DeepSpeed (huggingface.co) for the GLUE task wherein we see at par results with and without DeepSpeed.