Hi
I am adding adapter layers [1] between the layers of MT5 model, I am using deepspeed to run the models and I am always getting NaNs as training loss, I greatly appreciate any advice @stas on how one can make the training with deepspeed more stable and resolve the nan problem. Is there any parameters/setting I can play with when using deepspeed for better stability?
thank you very much