Notice that these values are significantly different (6.5 vs. 0.0084). If the last loss is the real training loss, then what losses were the logs outputting during training?
According to trainer.py the âtrain_lossâ in the metric is the average loss across all steps. The âlossâ at each logging step is the average loss from the previous logging step to current logging step.
The significant difference of âtrain_lossâ and âlossâ is probably because you resume training from a checkpoint where the âself._total_loss_scalarâ is cleared (not stored in the checkpoint), but the self.state.global_step is still correct. Therefore, using train_loss = self._total_loss_scalar / self.state.global_step to gettrain_loss resuming from a checkpoint will give you a value (e.g. 0.0084) way less than the real average loss.