MobileBERT decoder returns nans when using fp16 (amp)

When finetuning MobileBERT for token classification, if I try to use FP16 (AMP), then the encoder inside of mobileBERT’s forward call will return nan values, eventually resulting in the loss also becoming nan. This does not occur when using FP32 mode. It also happens when passing the first mini-batch in. (observed through a debugger)

I was able to reproduce this, and reported on it on the github issues, using the run_ner sample. example MobileBERT FP16 returns nan loss · Issue #11327 · huggingface/transformers · GitHub