Hi, I have a function that will load a pre-trained model and fine-tune it for sentiment analysis then calculates the F1 score and returns the result.
The problem is when I call this function multiple times with the exact same arguments, it will give the exact same metric score which is expected, except for the first time which is different, how is that possible?
This is my function which is written based on this tutorial in hugging face:
import uuid
import numpy as np
from datasets import (
load_dataset,
load_metric,
DatasetDict,
concatenate_datasets
)
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
DataCollatorWithPadding,
TrainingArguments,
Trainer,
)
CHECKPOINT = "distilbert-base-uncased"
SAVING_FOLDER = "sst2"
def custom_train(datasets, checkpoint=CHECKPOINT, saving_folder=SAVING_FOLDER):
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(example):
return tokenizer(example["sentence"], truncation=True)
tokenized_datasets = datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
saving_folder = f"{SAVING_FOLDER}_{str(uuid.uuid1())}"
training_args = TrainingArguments(saving_folder)
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
)
trainer.train()
predictions = trainer.predict(tokenized_datasets["test"])
print(predictions.predictions.shape, predictions.label_ids.shape)
preds = np.argmax(predictions.predictions, axis=-1)
metric_fun = load_metric("f1")
metric_result = metric_fun.compute(predictions=preds, references=predictions.label_ids)
return metric_result
And then I will run this function several times with the same datasets, and append the result of the returned F1 score each time:
raw_datasets = load_dataset("glue", "sst2")
small_datasets = DatasetDict({
"train": raw_datasets["train"].select(range(100)).flatten_indices(),
"validation": raw_datasets["validation"].select(range(100)).flatten_indices(),
"test": raw_datasets["validation"].select(range(100, 200)).flatten_indices(),
})
results = []
for i in range(4):
result = custom_train(small_datasets)
results.append(result)
And then when I check the results list:
[{'f1': 0.7755102040816325}, {'f1': 0.5797101449275361}, {'f1': 0.5797101449275361}, {'f1': 0.5797101449275361}]
Something that may come to mind is that when I load a pre-trained model, the head will be initialized with random weights and that is why the results are different, if that is the case why only the first one is different and the others are exactly the same?