I was finetuning Roberta on a multi-label classification problem 10 times and keeping track of each evaluation F1 score. However, even when randomizing the selection of training and testing data, as well as seed numbers (i.e. transformers.set_seed, numpy, tf, tf.keras, random), I find that towards the end the evaluation F1 scores repeat. I have an example of a run through below.
Any advice? The number of sentences used to train is 250, and the number used to evaluate is 250. The number used to test at the end is 200. All of these are randomly selected from a source of 5000+ sentences.
-
{āeval_lossā: 0.5038333535194397, āeval_f1ā: 0.8596491228070176, āeval_recallā: 0.8966666666666667, āeval_accuracyā: 0.885, āeval_runtimeā: 0.1429, āeval_samples_per_secondā: 1399.396, āeval_steps_per_secondā: 13.994}
-
{āeval_lossā: 0.4131013751029968, āeval_f1ā: 0.8973174175330509, āeval_recallā: 0.9133333333333333, āeval_accuracyā: 0.92, āeval_runtimeā: 0.1412, āeval_samples_per_secondā: 1416.786, āeval_steps_per_secondā: 14.168}
-
{āeval_lossā: 0.4108392000198364, āeval_f1ā: 0.90541946467417, āeval_recallā: 0.9299999999999999, āeval_accuracyā: 0.925, āeval_runtimeā: 0.1377, āeval_samples_per_secondā: 1452.709, āeval_steps_per_secondā: 14.527}
-
{āeval_lossā: 0.7171087861061096, āeval_f1ā: 0.8828281582436557, āeval_recallā: 0.9166666666666666, āeval_accuracyā: 0.905, āeval_runtimeā: 0.1368, āeval_samples_per_secondā: 1462.467, āeval_steps_per_secondā: 14.625}
-
{āeval_lossā: 1.1088610887527466, āeval_f1ā: 0.8746081504702194, āeval_recallā: 0.9, āeval_accuracyā: 0.9, āeval_runtimeā: 0.1388, āeval_samples_per_secondā: 1440.98, āeval_steps_per_secondā: 14.41}
-
{āeval_lossā: 1.3679903745651245, āeval_f1ā: 0.8815424420960754, āeval_recallā: 0.91, āeval_accuracyā: 0.905, āeval_runtimeā: 0.1375, āeval_samples_per_secondā: 1454.361, āeval_steps_per_secondā: 14.544}
-
{āeval_lossā: 1.5757907629013062, āeval_f1ā: 0.8815424420960754, āeval_recallā: 0.91, āeval_accuracyā: 0.905, āeval_runtimeā: 0.1393, āeval_samples_per_secondā: 1436.076, āeval_steps_per_secondā: 14.361}
-
{āeval_lossā: 1.7049527168273926, āeval_f1ā: 0.8815424420960754, āeval_recallā: 0.91, āeval_accuracyā: 0.905, āeval_runtimeā: 0.1389, āeval_samples_per_secondā: 1439.714, āeval_steps_per_secondā: 14.397}
-
{āeval_lossā: 1.7694827318191528, āeval_f1ā: 0.8815424420960754, āeval_recallā: 0.91, āeval_accuracyā: 0.905, āeval_runtimeā: 0.1383, āeval_samples_per_secondā: 1446.456, āeval_steps_per_secondā: 14.465}
-
{āeval_lossā: 1.80936861038208, āeval_f1ā: 0.8815424420960754, āeval_recallā: 0.91, āeval_accuracyā: 0.905, āeval_runtimeā: 0.1048, āeval_samples_per_secondā: 1909.088, āeval_steps_per_secondā: 19.091}