Trainer's step loss always drops sharply after each epoch regardless of model / data

Hello,
This is a specific question on the behaviour of the Trainer class, I hope someone is able to help. I am trying to plot the training loss after each step and therefore I have subclassed Trainer and made a couple of edits at line 1772 (new lines commented with # ADDED LINE) so the if statement now reads

step_losses = []  # ADDED LINE
if (
    ((step + 1) % args.gradient_accumulation_steps != 0)
    and args.local_rank != -1
    and args._no_sync_in_gradient_accumulation
):
    # Avoid unnecessary DDP synchronization since there will be no backward pass on this example.
    with model.no_sync():
        tr_loss_step = self.training_step(model, inputs)
        step_losses.append(tr_loss_step.item())  # ADDED LINE
else:
    tr_loss_step = self.training_step(model, inputs)
    step_losses.append(tr_loss_step.item())  # ADDED LINE

and at the end I write step_losses to disk and plot them.
I have noticed that, regardless of the dataset I use, if I calculate a moving average of the loss (for example every 50 steps to remove noise), at the start of each epoch the loss sharply drops before stabilising. This is a typical graph without smoothing
output2

and this is with smoothing (moving average)
output3

On the horizontal axis is the number of steps (in this case 1932 steps per epoch, with 10 epochs showing). You can clearly see the drop at the start of each epoch.
My question is: is this behaviour normal? What is causing the moving average of the loss to drop so sharply at the start of each epoch? It happens with HuggingFace’s native Trainer class so I would exclude any major code bug. Any help is much appreciated.

Hi @AndreaSottana!

You may have already ruled this out, but is it possible that your model is memorizing it’s inputs, and the steep drop you see each epoch could be accounted for by the model re-seeing inputs it’s already been trained on?

Hi @alexedw
Thanks for your reply. I am using the standard HuggingFace Seq2SeqTrainer class which should have random data shuffling implemented by default, so the data won’t be presented in the same order. Therefore the model could see a certain training sample at the start of epoch 1 and then at the end of epoch 2, or at the end of epoch 1 and at the start of epoch 2, or anything in between. While it is true that after the first epoch all data have already been seen, this is just part of a normal multi-epoch training cycle, which shouldn’t in itself justify this specific loss behaviour I am encountering. However, if you think this behaviour is normal for a specific reason, then please let me know. I will leave this question open to see if anyone else has other ideas.

Thanks again,
Andrea