Different intermediate results given different number of epochs

We are using Hugging Face API to fine-tune a pretrained model ( BertForSequenceClassification).
We see differences in the first five epochs between 5 and 15 epoch runs and do not understand why they would not be (nearly) identical given that only the number of epochs is different between those runs. ( the seed and other parameters are all the same).

For example:

Seed 7

5 epochs :
,loss,learning_rate,epoch,step
0,24.6558,4.955555555555556e-05,0.04,500,
1,19.9439,4.9111111111111114e-05,0.09,1000,
2,19.2654,4.866666666666667e-05,0.13,1500,
3,20.4078,4.8222222222222225e-05,0.18,2000,
4,20.3372,4.7777777777777784e-05,0.22,2500,
5,20.0602,4.7333333333333336e-05,0.27,3000,
6,19.6761,4.6888888888888895e-05,0.31,3500,
7,20.193,4.644444444444445e-05,0.36,4000,
8,19.1265,4.600000000000001e-05,0.4,4500,
9,19.1949,4.555555555555556e-05,0.44,5000,
10,19.5078,4.511111111111112e-05,0.49,5500,
11,20.7165,4.466666666666667e-05,0.53,6000,
12,20.1907,4.422222222222222e-05,0.58,6500,
13,19.6967,4.377777777777778e-05,0.62,7000,
14,19.6693,4.3333333333333334e-05,0.67,7500,
15,20.011,4.2888888888888886e-05,0.71,8000,
16,19.516,4.2444444444444445e-05,0.76,8500,
17,18.9949,4.2e-05,0.8,9000,

15 epochs:
,loss,learning_rate,epoch,step
0,18.9326,4.9851851851851855e-05,0.04,500,
1,5.6773,4.970370370370371e-05,0.09,1000,
2,4.6515,4.955555555555556e-05,0.13,1500,
3,4.2881,4.940740740740741e-05,0.18,2000,
4,3.641,4.925925925925926e-05,0.22,2500,
5,3.2491,4.9111111111111114e-05,0.27,3000,
6,3.012,4.896296296296297e-05,0.31,3500,
7,2.8161,4.881481481481482e-05,0.36,4000,
8,2.7497,4.866666666666667e-05,0.4,4500,
9,2.6776,4.851851851851852e-05,0.44,5000,
10,2.5254,4.837037037037037e-05,0.49,5500,
11,2.6059,4.8222222222222225e-05,0.53,6000,
12,2.5966,4.807407407407408e-05,0.58,6500,
13,2.2252,4.792592592592593e-05,0.62,7000,
14,2.3321,4.7777777777777784e-05,0.67,7500,
15,2.23,4.762962962962963e-05,0.71,8000,
16,2.3754,4.7481481481481483e-05,0.76,8500,

Seed 0 :

5 epochs:
,loss,learning_rate,epoch,step
0,17.7629,4.955555555555556e-05,0.04,500,
1,5.6264,4.9111111111111114e-05,0.09,1000,
2,4.9429,4.866666666666667e-05,0.13,1500,
3,4.5756,4.8222222222222225e-05,0.18,2000,
4,4.4063,4.7777777777777784e-05,0.22,2500,
5,3.9688,4.7333333333333336e-05,0.27,3000,
6,3.6656,4.6888888888888895e-05,0.31,3500,
7,3.6779,4.644444444444445e-05,0.36,4000,
8,3.2495,4.600000000000001e-05,0.4,4500,
9,3.2306,4.555555555555556e-05,0.44,5000,
10,3.1333,4.511111111111112e-05,0.49,5500,
11,2.7543,4.466666666666667e-05,0.53,6000,
12,3.1086,4.422222222222222e-05,0.58,6500,
13,3.0666,4.377777777777778e-05,0.62,7000,
14,3.156,4.3333333333333334e-05,0.67,7500,
15,2.5553,4.2888888888888886e-05,0.71,8000,
16,2.7727,4.2444444444444445e-05,0.76,8500,
17,2.651,4.2e-05,0.8,9000,

15 epochs:
,loss,learning_rate,epoch,step
0,14.8927,4.9851851851851855e-05,0.04,500,
1,5.4558,4.970370370370371e-05,0.09,1000,
2,4.065,4.955555555555556e-05,0.13,1500,
3,3.8751,4.940740740740741e-05,0.18,2000,
4,3.4581,4.925925925925926e-05,0.22,2500,
5,3.1641,4.9111111111111114e-05,0.27,3000,
6,2.8896,4.896296296296297e-05,0.31,3500,
7,2.8967,4.881481481481482e-05,0.36,4000,
8,2.5912,4.866666666666667e-05,0.4,4500,
9,2.5563,4.851851851851852e-05,0.44,5000,
10,2.482,4.837037037037037e-05,0.49,5500,
11,2.1695,4.8222222222222225e-05,0.53,6000,
12,2.447,4.807407407407408e-05,0.58,6500,
13,2.4438,4.792592592592593e-05,0.62,7000,
14,2.2014,4.7777777777777784e-05,0.67,7500,
15,2.2,4.762962962962963e-05,0.71,8000,

The only difference in the experiments is the number of epochs.
We also saved the train and validation split to a file and read it from there. To make sure we are reading in the same order.

My environment:

  • transformers version: 4.31.0
  • Platform: Linux-4.18.0-477.15.1.el8_8.x86_64-x86_64-with-glibc2.28
  • Python version: 3.9.6
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.1
  • Accelerate version: 0.21.0

Here is part of my code:

from transformers import (AutoTokenizer, DataCollatorWithPadding, TrainingArguments,
BertForSequenceClassification, Trainer, AutoConfig)
import datasets
import numpy as np
import torch
import torch.nn as nn
import random

random.seed(cseed)
np.random.seed(cseed)
torch.manual_seed(cseed)
torch.cuda.manual_seed_all(cseed)
os.environ[ā€˜CUBLAS_WORKSPACE_CONFIGā€™]=ā€œ:16:8ā€

tokenizer = AutoTokenizer.from_pretrained(checkpoint, model_max_length=max_token_len)
training_args = TrainingArguments(out_path,
save_total_limit = 10,
#load_best_model_at_end = True,
report_to=None,
evaluation_strategy=ā€œstepsā€,
eval_steps=11250,
do_eval=True,
num_train_epochs=epochs_num,
seed = cseed
)

from transformers import set_seed
set_seed(cseed)

trian_data_from_disk = datasets.Dataset.load_from_disk(tokenized_datasets_path+ā€œ/trainā€ , keep_in_memory=True)
validation_data_from_disk = datasets.Dataset.load_from_disk(tokenized_datasets_path+ā€œ/validationā€ , keep_in_memory=True)

model = BertForSequenceClassification.from_pretrained(checkpoint, num_labels=1)
loss_fn = nn.MSELoss()

trainer = CustomTrainer(
model,
training_args,
train_dataset=trian_data_from_disk,
eval_dataset=validation_data_from_disk,
data_collator=data_collator,
tokenizer=tokenizer,
)
training_results = trainer.train()

1 Like