[trainer] 'train_loss' different from 'loss'

Hi all,

I am using the Trainer and training GPT2 from scratch. I have trained it for 50 epochs and during training I had logs like the one shown below:

{'loss': 6.513, 'learning_rate': 1.1749500646222535e-07, 'epoch': 49.99}

However, after the last epoch I get a log with some train metrics:


 ***** train metrics *****
epoch                    =       50.0
train_loss               =     0.0084
train_runtime            = 0:12:31.27
train_samples            =    8716143
train_samples_per_second = 580087.323
train_steps_per_second   =    566.434

Notice that these values are significantly different (6.5 vs. 0.0084). If the last loss is the real training loss, then what losses were the logs outputting during training?

Thanks

6 Likes

Hi! I just had this exact question while doing my own training. Did you ever find out the answer?

I have the same problem , did you find the answer???

For future people coming with the same question:
The final result is the average of all the losses.

2 Likes

According to trainer.py the ‘train_loss’ in the metric is the average loss across all steps. The ‘loss’ at each logging step is the average loss from the previous logging step to current logging step.
The significant difference of ‘train_loss’ and ‘loss’ is probably because you resume training from a checkpoint where the “self._total_loss_scalar” is cleared (not stored in the checkpoint), but the self.state.global_step is still correct. Therefore, using
train_loss = self._total_loss_scalar / self.state.global_step to gettrain_loss resuming from a checkpoint will give you a value (e.g. 0.0084) way less than the real average loss.

4 Likes