Repeating eval-F1 scores with seed + data randomization

I was finetuning Roberta on a multi-label classification problem 10 times and keeping track of each evaluation F1 score. However, even when randomizing the selection of training and testing data, as well as seed numbers (i.e. transformers.set_seed, numpy, tf, tf.keras, random), I find that towards the end the evaluation F1 scores repeat. I have an example of a run through below.

Any advice? The number of sentences used to train is 250, and the number used to evaluate is 250. The number used to test at the end is 200. All of these are randomly selected from a source of 5000+ sentences.

  1. {ā€˜eval_loss’: 0.5038333535194397, ā€˜eval_f1’: 0.8596491228070176, ā€˜eval_recall’: 0.8966666666666667, ā€˜eval_accuracy’: 0.885, ā€˜eval_runtime’: 0.1429, ā€˜eval_samples_per_second’: 1399.396, ā€˜eval_steps_per_second’: 13.994}

  2. {ā€˜eval_loss’: 0.4131013751029968, ā€˜eval_f1’: 0.8973174175330509, ā€˜eval_recall’: 0.9133333333333333, ā€˜eval_accuracy’: 0.92, ā€˜eval_runtime’: 0.1412, ā€˜eval_samples_per_second’: 1416.786, ā€˜eval_steps_per_second’: 14.168}

  3. {ā€˜eval_loss’: 0.4108392000198364, ā€˜eval_f1’: 0.90541946467417, ā€˜eval_recall’: 0.9299999999999999, ā€˜eval_accuracy’: 0.925, ā€˜eval_runtime’: 0.1377, ā€˜eval_samples_per_second’: 1452.709, ā€˜eval_steps_per_second’: 14.527}

  4. {ā€˜eval_loss’: 0.7171087861061096, ā€˜eval_f1’: 0.8828281582436557, ā€˜eval_recall’: 0.9166666666666666, ā€˜eval_accuracy’: 0.905, ā€˜eval_runtime’: 0.1368, ā€˜eval_samples_per_second’: 1462.467, ā€˜eval_steps_per_second’: 14.625}

  5. {ā€˜eval_loss’: 1.1088610887527466, ā€˜eval_f1’: 0.8746081504702194, ā€˜eval_recall’: 0.9, ā€˜eval_accuracy’: 0.9, ā€˜eval_runtime’: 0.1388, ā€˜eval_samples_per_second’: 1440.98, ā€˜eval_steps_per_second’: 14.41}

  6. {ā€˜eval_loss’: 1.3679903745651245, ā€˜eval_f1’: 0.8815424420960754, ā€˜eval_recall’: 0.91, ā€˜eval_accuracy’: 0.905, ā€˜eval_runtime’: 0.1375, ā€˜eval_samples_per_second’: 1454.361, ā€˜eval_steps_per_second’: 14.544}

  7. {ā€˜eval_loss’: 1.5757907629013062, ā€˜eval_f1’: 0.8815424420960754, ā€˜eval_recall’: 0.91, ā€˜eval_accuracy’: 0.905, ā€˜eval_runtime’: 0.1393, ā€˜eval_samples_per_second’: 1436.076, ā€˜eval_steps_per_second’: 14.361}

  8. {ā€˜eval_loss’: 1.7049527168273926, ā€˜eval_f1’: 0.8815424420960754, ā€˜eval_recall’: 0.91, ā€˜eval_accuracy’: 0.905, ā€˜eval_runtime’: 0.1389, ā€˜eval_samples_per_second’: 1439.714, ā€˜eval_steps_per_second’: 14.397}

  9. {ā€˜eval_loss’: 1.7694827318191528, ā€˜eval_f1’: 0.8815424420960754, ā€˜eval_recall’: 0.91, ā€˜eval_accuracy’: 0.905, ā€˜eval_runtime’: 0.1383, ā€˜eval_samples_per_second’: 1446.456, ā€˜eval_steps_per_second’: 14.465}

  10. {ā€˜eval_loss’: 1.80936861038208, ā€˜eval_f1’: 0.8815424420960754, ā€˜eval_recall’: 0.91, ā€˜eval_accuracy’: 0.905, ā€˜eval_runtime’: 0.1048, ā€˜eval_samples_per_second’: 1909.088, ā€˜eval_steps_per_second’: 19.091}

If it is any help, here is my custom train function (the relevant part at least):

and here is a place where I do the seed randomizations:

1 Like

I wonder if the model output has reached the ideal value…

I don’t think so, given that in the third finetuning instance it reached an eval_f1 of 0.9054, higher than 0.88154.

1 Like

Hmm… For now by ChatGPT:


Summary

Your repeated F1 scores happen because with only 250 evaluation samples, the macro-F1 metric can take on only a few discrete values (ā€œquantizationā€), and if your train/test splits or model converges similarly, you’ll see the exact same F1 repeated. To fix this:

  1. Explicitly randomize splits each run by supplying a fresh seed to both .shuffle() and .train_test_split() in the Hugging Face Datasets API.
  2. Use k-fold (or repeated) cross-validation to average F1 over multiple splits, smoothing out quantization.
  3. Increase your evaluation set size or report complementary metrics (e.g., micro-F1, precision, recall) for finer granularity.

1. Truly Random Splits

By default, dataset.train_test_split() shuffles using an internal RNG that may not change across runs unless you pass seed yourself. Likewise, dataset.shuffle() needs its own seed. Example:

import random

seed = random.randint(0, 2**32 - 1)
ds = ds.shuffle(seed=seed)
train_test = ds.train_test_split(test_size=250, shuffle=True, seed=seed)
train_ds, eval_ds = train_test["train"], train_test["test"]

2. k-Fold Cross-Validation

Single splits on small data lead to coarse F1 jumps (~1/250ā‰ˆ0.004). Instead, use scikit-learn’s CV to average across folds:

from sklearn.model_selection import StratifiedKFold, cross_val_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=None)
scores = cross_val_score(
    estimator=my_trainer,  # a wrapper around your Hugging Face Trainer
    X=X_data,
    y=y_labels,
    cv=cv,
    scoring="f1_macro",
)
print(f"Macro-F1: {scores.mean():.4f} ± {scores.std():.4f}")

This yields a distribution of F1s and a more reliable mean ± std .


3. Increase Eval Size & Add Metrics

  • Larger hold-out (e.g., 1 000 samples) makes F1 steps smaller (1/1000=0.001), reducing repetition.
  • Complementary metrics like micro-F1, precision, recall, or AUC often vary smoothly and reveal differences that macro-F1 alone can miss (arxiv.org).

By ensuring truly random splits, averaging over multiple folds, and broadening your evaluation metrics or set size, you’ll eliminate the ā€œstuckā€ F1 behavior and obtain meaningful performance variability.