Loss behaviour for bert fine-tuning on QNLI

Hello everyone,

I trained Bert on the QNLI dataset for 20 epochs and here are the losses I got:

Example

We can see that the training loss is increasing before dropping between each epoch, and I don’t really understand this behaviour, has anyone an idea where it might come from ? Is it normal, or do you think it could come from a problem in my code ?

Also there are “spikes” appearing at the end of each epoch (especially after 8 epochs). I think it comes from the size of my batches. It is quite small, so my last batch contains only 2 samples.

Here are the specification of my training:

  • The model I used is “bert-base-cased” which I got pre-trained from the Transformers library, same for the tokenizer.
  • I split the training set in a 80/20 ratio to get the validation set.
  • I optimized using Adam with a learning rate of 3e-5, nothing else.
  • I am evaluating on the validation set 10 times per epoch.

And here is the code I use on each batch for training:

    optimizer.zero_grad()
    output = model(input_ids, attention_mask=attention_masks, token_type_ids = 
                   token_type_ids, labels=labels)
    loss = output.loss
    loss.backward()

    optimizer.step()

Thank you in advance for your help !

Seems like you’re overfitting by many epochs. The original BERT paper suggests fine-tuning for 2 to 4 epochs.

Some sanity checks:

  • did you shuffle your training dataloader? (PyTorch’ DataLoader class doesn’t shuffle by default)
  • I typically use a learning rate of 5e-5, might be worth trying out.

Dear @lucasval and @nielsr,

Thank you for your answers. In fact I already found what was wrong, I forgot to set the model in training mode at the end of my evaluation loop!

Doing so the dropout layers would only be random for the first 1/10th of each epoch, and would be in eval mode for the remaining 9/10th of each epoch, which explains the periodic behavior of the loss.

Again thank you for your replies!

Best regards,
Axel