Fixing the random seed in the Trainer does not produce the same results across runs

Hi folks,

I’ve noticed that fixing the seed in the Trainer does not produce the same results across multiple training runs.

For example, suppose we fix the seed in the TrainingArguments and instantiate a model and trainer as follows:

batch_size = 16
model_checkpoint = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    seed=123
)

trainer = Trainer(
    model,
    args,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

If I fine-tune this on COLA, I get the following results

Epoch	Training Loss	Validation Loss	Matthews Correlation	Runtime	Samples Per Second
1	0.517900	0.467562	0.455635	0.740500	1408.496000
2	0.335300	0.500026	0.490934	0.686200	1519.946000
3	0.232300	0.618692	0.493626	0.693100	1504.833000

Now suppose we re-instantiate the model and trainer, and fine-tune:

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
trainer = Trainer(
    model,
    args,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

I would expect to get exactly the same results, but instead find small differences in the outputs:

Epoch	Training Loss	Validation Loss	Matthews Correlation	Runtime	Samples Per Second
1	0.519000	0.473168	0.458883	0.702900	1483.911000
2	0.335800	0.502644	0.486363	0.701400	1487.070000
3	0.229200	0.616023	0.497130	0.710800	1467.258000

My best guess is that the DataLoader is the source of the difference since it uses a RandomSampler. However, I would have thought that fixing the seed would also fix that.

Although these differences are tiny, having reproducible runs really helps stay sane during debugging and I’m wondering whether anyone here knows how to fix this?

For context, I’m working in Jupyter notebooks and here is a small Colab notebook from which I produced the above numbers: https://colab.research.google.com/drive/15nv40o81JfKwubFBVOjZkqk8bRauXig8?usp=sharing

For full reproducibility you need to instantiate your model inside the Trainer by using the model_init argument (or setting a seed before instantiating your model). You have random weights in your model head and those are different in your two runs in the code you show.

1 Like

Thanks for the clarification @sgugger! I opened a small PR to make this more explicit in the docs: Clarify definition of seed argument in TrainingArguments by lewtun · Pull Request #9903 · huggingface/transformers · GitHub

1 Like

Hi I have tried doing it so and it still doesn’t work:

# Define model
def model_init():
  return AutoModelForSequenceClassification.from_pretrained(model_name, 
                                                           num_labels=3,
                                                           output_attentions = False, # Whether the model returns attentions weights.
                                                           output_hidden_states = False,
                                                           return_dict=True 
                                                           )

trainer = Trainer(
    model_init=model_init,
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,             # evaluation dataset
    compute_metrics=compute_metrics,
    #callbacks=[EarlyStoppingCallback(3, 0.0)] # early stopping if results dont improve after 3 epochs
)