Multiple training will give exactly the same result except for the first time

Hi, I have a function that will load a pre-trained model and fine-tune it for sentiment analysis then calculates the F1 score and returns the result.
The problem is when I call this function multiple times with the exact same arguments, it will give the exact same metric score which is expected, except for the first time which is different, how is that possible?

This is my function which is written based on this tutorial in hugging face:

import uuid

import numpy as np

from datasets import (
    load_dataset,
    load_metric,
    DatasetDict,
    concatenate_datasets
)

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
)

CHECKPOINT = "distilbert-base-uncased"
SAVING_FOLDER = "sst2"
def custom_train(datasets, checkpoint=CHECKPOINT, saving_folder=SAVING_FOLDER):

    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    
    def tokenize_function(example):
        return tokenizer(example["sentence"], truncation=True)

    tokenized_datasets = datasets.map(tokenize_function, batched=True)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    saving_folder = f"{SAVING_FOLDER}_{str(uuid.uuid1())}"
    training_args = TrainingArguments(saving_folder)

    trainer = Trainer(
        model,
        training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        data_collator=data_collator,
        tokenizer=tokenizer,
    )
    
    trainer.train()
    
    predictions = trainer.predict(tokenized_datasets["test"])
    print(predictions.predictions.shape, predictions.label_ids.shape)
    preds = np.argmax(predictions.predictions, axis=-1)
    
    metric_fun = load_metric("f1")
    metric_result = metric_fun.compute(predictions=preds, references=predictions.label_ids)
    
    return metric_result

And then I will run this function several times with the same datasets, and append the result of the returned F1 score each time:

raw_datasets = load_dataset("glue", "sst2")

small_datasets = DatasetDict({
    "train": raw_datasets["train"].select(range(100)).flatten_indices(),
    "validation": raw_datasets["validation"].select(range(100)).flatten_indices(),
    "test": raw_datasets["validation"].select(range(100, 200)).flatten_indices(),
})

results = []
for i in range(4):
    result = custom_train(small_datasets)
    results.append(result)

And then when I check the results list:

[{'f1': 0.7755102040816325}, {'f1': 0.5797101449275361}, {'f1': 0.5797101449275361}, {'f1': 0.5797101449275361}]

Something that may come to mind is that when I load a pre-trained model, the head will be initialized with random weights and that is why the results are different, if that is the case why only the first one is different and the others are exactly the same?

You need to set the seed before instantiating your model, otherwise the random head is not initialized the same way, that’s why the first run will always be different.
The subsequent runs are all the same because the seed has been set by the Trainer in the train method.

To set the seed:

from transformers import set_seed

set_seed(42)
1 Like