Multiple training will give exactly the same result except for the first time

SMMousavi · July 19, 2021, 1:17pm

Hi, I have a function that will load a pre-trained model and fine-tune it for sentiment analysis then calculates the F1 score and returns the result.
The problem is when I call this function multiple times with the exact same arguments, it will give the exact same metric score which is expected, except for the first time which is different, how is that possible?

This is my function which is written based on this tutorial in hugging face:

import uuid

import numpy as np

from datasets import (
    load_dataset,
    load_metric,
    DatasetDict,
    concatenate_datasets
)

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
)

CHECKPOINT = "distilbert-base-uncased"
SAVING_FOLDER = "sst2"
def custom_train(datasets, checkpoint=CHECKPOINT, saving_folder=SAVING_FOLDER):

    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    
    def tokenize_function(example):
        return tokenizer(example["sentence"], truncation=True)

    tokenized_datasets = datasets.map(tokenize_function, batched=True)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    saving_folder = f"{SAVING_FOLDER}_{str(uuid.uuid1())}"
    training_args = TrainingArguments(saving_folder)

    trainer = Trainer(
        model,
        training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        data_collator=data_collator,
        tokenizer=tokenizer,
    )
    
    trainer.train()
    
    predictions = trainer.predict(tokenized_datasets["test"])
    print(predictions.predictions.shape, predictions.label_ids.shape)
    preds = np.argmax(predictions.predictions, axis=-1)
    
    metric_fun = load_metric("f1")
    metric_result = metric_fun.compute(predictions=preds, references=predictions.label_ids)
    
    return metric_result

And then I will run this function several times with the same datasets, and append the result of the returned F1 score each time:

raw_datasets = load_dataset("glue", "sst2")

small_datasets = DatasetDict({
    "train": raw_datasets["train"].select(range(100)).flatten_indices(),
    "validation": raw_datasets["validation"].select(range(100)).flatten_indices(),
    "test": raw_datasets["validation"].select(range(100, 200)).flatten_indices(),
})

results = []
for i in range(4):
    result = custom_train(small_datasets)
    results.append(result)

And then when I check the results list:

[{'f1': 0.7755102040816325}, {'f1': 0.5797101449275361}, {'f1': 0.5797101449275361}, {'f1': 0.5797101449275361}]

Something that may come to mind is that when I load a pre-trained model, the head will be initialized with random weights and that is why the results are different, if that is the case why only the first one is different and the others are exactly the same?

sgugger · July 19, 2021, 5:55pm

You need to set the seed before instantiating your model, otherwise the random head is not initialized the same way, that’s why the first run will always be different.
The subsequent runs are all the same because the seed has been set by the Trainer in the train method.

To set the seed:

from transformers import set_seed

set_seed(42)

Topic		Replies	Views
Same metrics after every epoch Beginners	4	342	May 30, 2024
Evaluating huggingface transformer with trainer gives different results 🤗Transformers	0	911	March 22, 2023
Finetuning with Trainer doesn't seem to learn since second epoch Beginners	3	2398	January 19, 2023
How to add multiple metrics to Huggingface Transformers Trainer? 🤗Transformers	1	2058	July 26, 2022
Differences in prediction from train end to checkpoint Beginners	3	821	September 11, 2023

Multiple training will give exactly the same result except for the first time

Related topics