The same hyperparameters with deepspeed is worse than without deepseepd

I’m training a model (dpcnn) with batch size 512 with out deepspeed, the actual batch size is equal to (512/8)=64 (I have 8 gpus) with deepspeed, but the deepspeed’s loss and accuracy is far worse than without deepspeed. The two experiments have same code, same hyperparameters except batch size. Any one have some idea to explain this situation?

1 Like

Hello,

We need a small script that we can run to reproduce the behaviour. As per our experiments, it works as expected. For your reference, please go through this blog: Accelerate Large Model Training using DeepSpeed (huggingface.co) for the GLUE task wherein we see at par results with and without DeepSpeed.

i have the same quesion with you. in my code, the acc is same with the code without using deepspeed,but the loss is very high.
I wonder if u have solved this question, could u give me some suggest.thanku