AWD-LSTM beats finetuned BERT as train ds decreases?! 🤷🏽

  • I’m building a radiology report classifier. In this particular example, input: radiology report text, target: If patient has micro-calcifications (pos, neg, n/a)
  • There are two models I’m using; 1) AWD-LSTM 2) BERT model that was pre-trained on radiology reports
  • The dataset is stratified by target labels and split 50/50 where n=450/450.
  • When I train on the full split with train_ds = 450, I get a 90+ accuracy with the pre-trained BERT model vs ~80 accuracy for the AWD-LSTM.
  • However as I start to scale back the training dataset size df_train = df_train.sample(frac=frac, random_state=42), I see worse performance in the BERT model vs the AWD-LSTM which is not what I expected.

Have you seen similar results in your training?

Found the culprit!
It happened to be a fluke which derived from how I was creating the progressively smaller subsets. I made the mistake of doing a 50/50 split, then randomly sample from the 50 split and taking a percentage for the new subset. This approach makes the stratified approach moot.

One thing that still confuses me is the following:

  • Why is an AWD-LSTM based text classifier getting comparable results to a domain-specific BERT model?
    • I skipped the LM fine-tuning for the AWD-LSTM
    • The BERT model was fine-tuned on Radiology reports
    • My classification dataset size is fairly small