Gettings nan with deepspeed

Hi

I am adding adapter layers [1] between the layers of MT5 model, I am using deepspeed to run the models and I am always getting NaNs as training loss, I greatly appreciate any advice @stas on how one can make the training with deepspeed more stable and resolve the nan problem. Is there any parameters/setting I can play with when using deepspeed for better stability?

thank you very much

[1] https://arxiv.org/pdf/1902.00751.pdf