Hi, I was trying to fine-tune bert-base-uncased and bert-large-uncased on a binary text classification task. I loaded the model using transformers from_pretrained() function.
The bert-base-uncased model can achieve roughly 0.9 auc, while bert-large-uncased model only has around 0.5 auc. I’m really confused by the results. I have the same TrainingArguments for both models. I wonder what could possibly be the reason for the terrible performance of bert-large model? The validation loss on bert-large starts to increase after the first epoch. Is it because the bert-large model overfits quickly with my sample (but auc cannot be 0.5, right?)
Thank you.