Fixing the random seed in the Trainer does not produce the same results across runs

lewtun · January 29, 2021, 8:56pm

Hi folks,

I’ve noticed that fixing the seed in the Trainer does not produce the same results across multiple training runs.

For example, suppose we fix the seed in the TrainingArguments and instantiate a model and trainer as follows:

batch_size = 16
model_checkpoint = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    seed=123
)

trainer = Trainer(
    model,
    args,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

If I fine-tune this on COLA, I get the following results

Epoch	Training Loss	Validation Loss	Matthews Correlation	Runtime	Samples Per Second
1	0.517900	0.467562	0.455635	0.740500	1408.496000
2	0.335300	0.500026	0.490934	0.686200	1519.946000
3	0.232300	0.618692	0.493626	0.693100	1504.833000

Now suppose we re-instantiate the model and trainer, and fine-tune:

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
trainer = Trainer(
    model,
    args,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

I would expect to get exactly the same results, but instead find small differences in the outputs:

Epoch	Training Loss	Validation Loss	Matthews Correlation	Runtime	Samples Per Second
1	0.519000	0.473168	0.458883	0.702900	1483.911000
2	0.335800	0.502644	0.486363	0.701400	1487.070000
3	0.229200	0.616023	0.497130	0.710800	1467.258000

My best guess is that the DataLoader is the source of the difference since it uses a RandomSampler. However, I would have thought that fixing the seed would also fix that.

Although these differences are tiny, having reproducible runs really helps stay sane during debugging and I’m wondering whether anyone here knows how to fix this?

For context, I’m working in Jupyter notebooks and here is a small Colab notebook from which I produced the above numbers: https://colab.research.google.com/drive/15nv40o81JfKwubFBVOjZkqk8bRauXig8?usp=sharing

sgugger · January 30, 2021, 12:09am

For full reproducibility you need to instantiate your model inside the Trainer by using the model_init argument (or setting a seed before instantiating your model). You have random weights in your model head and those are different in your two runs in the code you show.

lewtun · January 30, 2021, 11:06am

Thanks for the clarification @sgugger! I opened a small PR to make this more explicit in the docs: Clarify definition of seed argument in TrainingArguments by lewtun · Pull Request #9903 · huggingface/transformers · GitHub

theudster · June 16, 2021, 9:22pm

Hi I have tried doing it so and it still doesn’t work:

# Define model
def model_init():
  return AutoModelForSequenceClassification.from_pretrained(model_name, 
                                                           num_labels=3,
                                                           output_attentions = False, # Whether the model returns attentions weights.
                                                           output_hidden_states = False,
                                                           return_dict=True 
                                                           )

trainer = Trainer(
    model_init=model_init,
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,             # evaluation dataset
    compute_metrics=compute_metrics,
    #callbacks=[EarlyStoppingCallback(3, 0.0)] # early stopping if results dont improve after 3 epochs
)

ialuronico · February 2, 2022, 5:43am

Actually using your exact same code fixed the issue for me.

tnt306 · March 27, 2025, 4:10pm

As of 03.2025, the argument model_init in SFTTrainer is no longer there.

I investigated the code and found at this point of commit, it was deleted completely.

And now I don’t know how to let Trainer initialize the Model, and thus, using the seed to fix randomness.

Topic		Replies	Views
Why there are other results with the same seed for Transformers? 🤗Transformers	0	332	December 18, 2022
Llama2-7b-hf model not reproducible across runs Models	1	510	March 15, 2024
Is the trainer's seed reset at every model_init? 🤗Transformers	4	1238	March 28, 2022
Set_seed and training argument's data_seed Beginners	2	177	August 6, 2024
Multiple training will give exactly the same result except for the first time 🤗Transformers	1	3557	July 19, 2021

Fixing the random seed in the Trainer does not produce the same results across runs

Related topics