Finetuning MT0 produce 0 loss

I would like to finetune a MT0 model with fp16, int8 or int4. However, the loss is always 0 because of nan. Wondering how to fix this 0 loss issue ?

[INFO|trainer.py:327] 2023-08-26 21:05:39,168 >> {'loss': 3.3959, 'learning_rate': 9.993190040434134e-07, 'train_runtime': 14.3111, 'train_samples_per_second': 8.944, 'train_num_samples_consumed': 128, 'job_progress': 0.0006809959565865078, 'epoch': 0.0} [INFO|trainer.py:327] 2023-08-26 21:05:52,255 >> {'loss': 0.0, 'learning_rate': 9.986380080868269e-07, 'train_runtime': 13.0882, 'train_samples_per_second': 9.78, 'train_num_samples_consumed': 256, 'job_progress': 0.0013619919131730156, 'epoch': 0.01} [INFO|trainer.py:327] 2023-08-26 21:06:05,317 >> {'loss': 0.0, 'learning_rate': 9.979570121302404e-07, 'train_runtime': 13.0619, 'train_samples_per_second': 9.799, 'train_num_samples_consumed': 384, 'job_progress': 0.002042987869759523, 'epoch': 0.01}

Hi,
it seems like there is overflow happening due to fp16 being too narrow. Maybe set it to bf16 instead.

From mt0 model pages, I saw it is training with bf16

  • Finetuning steps: 25000
  • Finetuning tokens: 4.62 billion
  • Precision: bfloat16