I am dealing with a binary classification task (non-English language) of relatively long documents (~4k words on average). I have tested a Logistic Regression trained on simplistic BoW features, yielding reasonable performance.
I am now testing the multilingual BERT, with two linear layers on top of it and using the Cross-Entropy loss; however, its performance is quite low. The “annoying” part is that on a given test set, BERT always predicts the majority class. It is worth saying that the dataset (both train and test) is rather imbalanced (80/20).
I have tried the following without any luck:
a) Play around with the learning rate, class weighting, num of linear layers & associated configurations.
b) Select different parts of the document as input to BERT.
c) Generate balanced samples (incl. oversampling the minority class).
I have also tried generating a synthetic toy dataset of 1K examples from one document belonging to one class and another 1K examples from one document belonging belonging to the other class - the performance was perfect, as expected.
Is there something obvious that I am missing in terms of debugging my model? Is the problem the imbalanced nature of the dataset I am working with? Could a Focal loss (or anything else) help on this end?