Repeating eval-F1 scores with seed + data randomization

rkabir · July 1, 2025, 7:31pm

I was finetuning Roberta on a multi-label classification problem 10 times and keeping track of each evaluation F1 score. However, even when randomizing the selection of training and testing data, as well as seed numbers (i.e. transformers.set_seed, numpy, tf, tf.keras, random), I find that towards the end the evaluation F1 scores repeat. I have an example of a run through below.

Any advice? The number of sentences used to train is 250, and the number used to evaluate is 250. The number used to test at the end is 200. All of these are randomly selected from a source of 5000+ sentences.

{‘eval_loss’: 0.5038333535194397, ‘eval_f1’: 0.8596491228070176, ‘eval_recall’: 0.8966666666666667, ‘eval_accuracy’: 0.885, ‘eval_runtime’: 0.1429, ‘eval_samples_per_second’: 1399.396, ‘eval_steps_per_second’: 13.994}
{‘eval_loss’: 0.4131013751029968, ‘eval_f1’: 0.8973174175330509, ‘eval_recall’: 0.9133333333333333, ‘eval_accuracy’: 0.92, ‘eval_runtime’: 0.1412, ‘eval_samples_per_second’: 1416.786, ‘eval_steps_per_second’: 14.168}
{‘eval_loss’: 0.4108392000198364, ‘eval_f1’: 0.90541946467417, ‘eval_recall’: 0.9299999999999999, ‘eval_accuracy’: 0.925, ‘eval_runtime’: 0.1377, ‘eval_samples_per_second’: 1452.709, ‘eval_steps_per_second’: 14.527}
{‘eval_loss’: 0.7171087861061096, ‘eval_f1’: 0.8828281582436557, ‘eval_recall’: 0.9166666666666666, ‘eval_accuracy’: 0.905, ‘eval_runtime’: 0.1368, ‘eval_samples_per_second’: 1462.467, ‘eval_steps_per_second’: 14.625}
{‘eval_loss’: 1.1088610887527466, ‘eval_f1’: 0.8746081504702194, ‘eval_recall’: 0.9, ‘eval_accuracy’: 0.9, ‘eval_runtime’: 0.1388, ‘eval_samples_per_second’: 1440.98, ‘eval_steps_per_second’: 14.41}
{‘eval_loss’: 1.3679903745651245, ‘eval_f1’: 0.8815424420960754, ‘eval_recall’: 0.91, ‘eval_accuracy’: 0.905, ‘eval_runtime’: 0.1375, ‘eval_samples_per_second’: 1454.361, ‘eval_steps_per_second’: 14.544}
{‘eval_loss’: 1.5757907629013062, ‘eval_f1’: 0.8815424420960754, ‘eval_recall’: 0.91, ‘eval_accuracy’: 0.905, ‘eval_runtime’: 0.1393, ‘eval_samples_per_second’: 1436.076, ‘eval_steps_per_second’: 14.361}
{‘eval_loss’: 1.7049527168273926, ‘eval_f1’: 0.8815424420960754, ‘eval_recall’: 0.91, ‘eval_accuracy’: 0.905, ‘eval_runtime’: 0.1389, ‘eval_samples_per_second’: 1439.714, ‘eval_steps_per_second’: 14.397}
{‘eval_loss’: 1.7694827318191528, ‘eval_f1’: 0.8815424420960754, ‘eval_recall’: 0.91, ‘eval_accuracy’: 0.905, ‘eval_runtime’: 0.1383, ‘eval_samples_per_second’: 1446.456, ‘eval_steps_per_second’: 14.465}
{‘eval_loss’: 1.80936861038208, ‘eval_f1’: 0.8815424420960754, ‘eval_recall’: 0.91, ‘eval_accuracy’: 0.905, ‘eval_runtime’: 0.1048, ‘eval_samples_per_second’: 1909.088, ‘eval_steps_per_second’: 19.091}

rkabir · July 1, 2025, 8:26pm

If it is any help, here is my custom train function (the relevant part at least):

and here is a place where I do the seed randomizations:

John6666 · July 1, 2025, 9:00pm

I wonder if the model output has reached the ideal value…

rkabir · July 2, 2025, 4:37pm

I don’t think so, given that in the third finetuning instance it reached an eval_f1 of 0.9054, higher than 0.88154.

John6666 · July 2, 2025, 4:52pm

Hmm… For now by ChatGPT:

Summary

Your repeated F1 scores happen because with only 250 evaluation samples, the macro-F1 metric can take on only a few discrete values (“quantization”), and if your train/test splits or model converges similarly, you’ll see the exact same F1 repeated. To fix this:

Explicitly randomize splits each run by supplying a fresh seed to both .shuffle() and .train_test_split() in the Hugging Face Datasets API.
Use k-fold (or repeated) cross-validation to average F1 over multiple splits, smoothing out quantization.
Increase your evaluation set size or report complementary metrics (e.g., micro-F1, precision, recall) for finer granularity.

1. Truly Random Splits

By default, dataset.train_test_split() shuffles using an internal RNG that may not change across runs unless you pass seed yourself. Likewise, dataset.shuffle() needs its own seed. Example:

import random

seed = random.randint(0, 2**32 - 1)
ds = ds.shuffle(seed=seed)
train_test = ds.train_test_split(test_size=250, shuffle=True, seed=seed)
train_ds, eval_ds = train_test["train"], train_test["test"]

2. k-Fold Cross-Validation

Single splits on small data lead to coarse F1 jumps (~1/250≈0.004). Instead, use scikit-learn’s CV to average across folds:

from sklearn.model_selection import StratifiedKFold, cross_val_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=None)
scores = cross_val_score(
    estimator=my_trainer,  # a wrapper around your Hugging Face Trainer
    X=X_data,
    y=y_labels,
    cv=cv,
    scoring="f1_macro",
)
print(f"Macro-F1: {scores.mean():.4f} ± {scores.std():.4f}")

This yields a distribution of F1s and a more reliable mean ± std .

3. Increase Eval Size & Add Metrics

Larger hold-out (e.g., 1 000 samples) makes F1 steps smaller (1/1000=0.001), reducing repetition.
Complementary metrics like micro-F1, precision, recall, or AUC often vary smoothly and reveal differences that macro-F1 alone can miss (arxiv.org).

By ensuring truly random splits, averaging over multiple folds, and broadening your evaluation metrics or set size, you’ll eliminate the “stuck” F1 behavior and obtain meaningful performance variability.

Topic		Replies	Views
Multiple training will give exactly the same result except for the first time 🤗Transformers	1	3558	July 19, 2021
Regarding the seed in HF trainer 🤗Transformers	0	317	June 14, 2022
BERT Model always with different results Beginners	0	434	January 17, 2022
Different BERT results Beginners	1	1179	May 25, 2022
High inconsistancies while Training Beginners	0	250	July 29, 2022

Repeating eval-F1 scores with seed + data randomization

Summary

1. Truly Random Splits

2. k-Fold Cross-Validation

3. Increase Eval Size & Add Metrics

Related topics