Dealing with Imbalanced Datasets?

Hi everyone,

I am dealing with a binary classification task (non-English language) of relatively long documents (~4k words on average). I have tested a Logistic Regression trained on simplistic BoW features, yielding reasonable performance.

I am now testing the multilingual BERT, with two linear layers on top of it and using the Cross-Entropy loss; however, its performance is quite low. The “annoying” part is that on a given test set, BERT always predicts the majority class. It is worth saying that the dataset (both train and test) is rather imbalanced (80/20).

I have tried the following without any luck:

a) Play around with the learning rate, class weighting, num of linear layers & associated configurations.
b) Select different parts of the document as input to BERT.
c) Generate balanced samples (incl. oversampling the minority class).

I have also tried generating a synthetic toy dataset of 1K examples from one document belonging to one class and another 1K examples from one document belonging belonging to the other class - the performance was perfect, as expected.

Is there something obvious that I am missing in terms of debugging my model? Is the problem the imbalanced nature of the dataset I am working with? Could a Focal loss (or anything else) help on this end?

Hi @aguarius, my naive guess is that the length of your documents is the source of the low performance since BERT has a maximum context size of 512 tokens which is only a handful of paragraphs.

One somewhat hacky approach to this could be to chunk your document into smaller passages, extract the hidden states per passage and then average them as features for your linear layers.

What language(s) are in your corpus? That might be another source of difficulty since mBERT is not great on all of its languages and perhaps you can work with a better model like XLM-RoBERTa (or even a monolingual one if that’s possible)