Loss behaviour for bert fine-tuning on QNLI

axeldinh · May 7, 2021, 10:26am

Hello everyone,

I trained Bert on the QNLI dataset for 20 epochs and here are the losses I got:

Example

We can see that the training loss is increasing before dropping between each epoch, and I don’t really understand this behaviour, has anyone an idea where it might come from ? Is it normal, or do you think it could come from a problem in my code ?

Also there are “spikes” appearing at the end of each epoch (especially after 8 epochs). I think it comes from the size of my batches. It is quite small, so my last batch contains only 2 samples.

Here are the specification of my training:

The model I used is “bert-base-cased” which I got pre-trained from the Transformers library, same for the tokenizer.
I split the training set in a 80/20 ratio to get the validation set.
I optimized using Adam with a learning rate of 3e-5, nothing else.
I am evaluating on the validation set 10 times per epoch.

And here is the code I use on each batch for training:

    optimizer.zero_grad()
    output = model(input_ids, attention_mask=attention_masks, token_type_ids = 
                   token_type_ids, labels=labels)
    loss = output.loss
    loss.backward()

    optimizer.step()

Thank you in advance for your help !

lucasval · October 15, 2021, 1:17pm

Seems like you’re overfitting by many epochs. The original BERT paper suggests fine-tuning for 2 to 4 epochs.

nielsr · October 15, 2021, 1:30pm

Some sanity checks:

did you shuffle your training dataloader? (PyTorch’ DataLoader class doesn’t shuffle by default)
I typically use a learning rate of 5e-5, might be worth trying out.

axeldinh · October 15, 2021, 2:24pm

Dear @lucasval and @nielsr,

Thank you for your answers. In fact I already found what was wrong, I forgot to set the model in training mode at the end of my evaluation loop!

Doing so the dropout layers would only be random for the first 1/10th of each epoch, and would be in eval mode for the remaining 9/10th of each epoch, which explains the periodic behavior of the loss.

Again thank you for your replies!

Best regards,
Axel

Topic		Replies	Views
Weird losses while fine tuning Beginners	0	339	September 17, 2021
Accuracy decreasing after saving/reloading my model 🤗Transformers	3	9	July 8, 2025
Training Loss Sudden Spike After 8 Hours of pre-training a BERT Model 🤗Transformers	0	1126	September 13, 2023
My model doesn't learn with my triplet loss Intermediate	3	67	April 22, 2025
BERT fine tuning low epochs? Beginners	1	4740	September 13, 2023

Loss behaviour for bert fine-tuning on QNLI

Related topics