Hi folks,
I’ve noticed that fixing the seed in the Trainer
does not produce the same results across multiple training runs.
For example, suppose we fix the seed in the TrainingArguments
and instantiate a model and trainer as follows:
batch_size = 16
model_checkpoint = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
args = TrainingArguments(
"test-glue",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=3,
seed=123
)
trainer = Trainer(
model,
args,
train_dataset=train_ds,
eval_dataset=valid_ds,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
If I fine-tune this on COLA, I get the following results
Epoch Training Loss Validation Loss Matthews Correlation Runtime Samples Per Second
1 0.517900 0.467562 0.455635 0.740500 1408.496000
2 0.335300 0.500026 0.490934 0.686200 1519.946000
3 0.232300 0.618692 0.493626 0.693100 1504.833000
Now suppose we re-instantiate the model and trainer, and fine-tune:
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
trainer = Trainer(
model,
args,
train_dataset=train_ds,
eval_dataset=valid_ds,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
I would expect to get exactly the same results, but instead find small differences in the outputs:
Epoch Training Loss Validation Loss Matthews Correlation Runtime Samples Per Second
1 0.519000 0.473168 0.458883 0.702900 1483.911000
2 0.335800 0.502644 0.486363 0.701400 1487.070000
3 0.229200 0.616023 0.497130 0.710800 1467.258000
My best guess is that the DataLoader
is the source of the difference since it uses a RandomSampler
. However, I would have thought that fixing the seed would also fix that.
Although these differences are tiny, having reproducible runs really helps stay sane during debugging and I’m wondering whether anyone here knows how to fix this?
For context, I’m working in Jupyter notebooks and here is a small Colab notebook from which I produced the above numbers: https://colab.research.google.com/drive/15nv40o81JfKwubFBVOjZkqk8bRauXig8?usp=sharing