How to make fine-tuning training reliable?

I am using AutoTokenizer.from_pretrained('bert-base-cased') to develop a candidate classifier for a content-based recommendation system (smart RSS reader). I trained it three times using code very similar to the starter example here

and you can find three Jupyter notebooks for the training here

I use the ROC AUC as my metric because I think it’s appropriate for my case where I mostly take top-scoring results. I ran the test data against two industrialized models, a bag of words/logistic regression model and another model that uses the all-MiniLM-L6-v2 embedding from

and puts it through a probability-calibrated SVM. The bow model gets an AUC of 0.66 on this data set and the embed model gets about 0.68. This data set contains 10 days worth of judgements, if I train on 40 days the AUC values are more like 0.72 and 0.76. I think the bow model is saturated there but I think the embed model would improve if I added more data.

With 10 days worth of data I get highly unreliable results from my training procedure, in three tries I get 0.72 (great!), 0.66 (same as bow), and 0.54 (fail!). In all of those cases I think the training loss history is reflective of the results.

I am familiar with training networks with “early stopping” but most of my experience was with autoencoders, LSTM and CNN networks just before transformers became cool. I’m a little baffled by what I’ve seen so far with larger training sets (40 or 90 days worth of data) because often I see very little change in the training and evaluation loss after running the first 800 samples or so and no improvement over multiple epochs so I don’t see how to get started with early stopping (although all three of these runs show a loss/samples curve that look promising.)

So (1) I would like to industrialize a fine-tuned model for this function, (2) I do care about the AUC metric, but (3) I can’t build this into a script if I don’t get reliable results, and (4) I can’t do experiments to improve quality if the results I get are so random (see Deming.)

So, what can I do to get more reliable results from training?

… I improved matters quite a bit by setting learning_rate=2e-5 in the TrainingParameters. With this setting it is now typical to get an 0,72 AUC on the 10 day dataset with the transformer classifier although it is struggling to beat BOW w/ 40 days of data,