Dealing with Imbalanced Datasets?

aguarius · March 11, 2021, 9:18pm

Hi everyone,

I am dealing with a binary classification task (non-English language) of relatively long documents (~4k words on average). I have tested a Logistic Regression trained on simplistic BoW features, yielding reasonable performance.

I am now testing the multilingual BERT, with two linear layers on top of it and using the Cross-Entropy loss; however, its performance is quite low. The “annoying” part is that on a given test set, BERT always predicts the majority class. It is worth saying that the dataset (both train and test) is rather imbalanced (80/20).

I have tried the following without any luck:

a) Play around with the learning rate, class weighting, num of linear layers & associated configurations.
b) Select different parts of the document as input to BERT.
c) Generate balanced samples (incl. oversampling the minority class).

I have also tried generating a synthetic toy dataset of 1K examples from one document belonging to one class and another 1K examples from one document belonging belonging to the other class - the performance was perfect, as expected.

Is there something obvious that I am missing in terms of debugging my model? Is the problem the imbalanced nature of the dataset I am working with? Could a Focal loss (or anything else) help on this end?

lewtun · March 11, 2021, 10:05pm

Hi @aguarius, my naive guess is that the length of your documents is the source of the low performance since BERT has a maximum context size of 512 tokens which is only a handful of paragraphs.

One somewhat hacky approach to this could be to chunk your document into smaller passages, extract the hidden states per passage and then average them as features for your linear layers.

What language(s) are in your corpus? That might be another source of difficulty since mBERT is not great on all of its languages and perhaps you can work with a better model like XLM-RoBERTa (or even a monolingual one if that’s possible)

Topic		Replies	Views
Questions about my first code on fine-tuning BERT model for text-classification Beginners	0	1507	April 26, 2022
Does high number of output labels affect the performance of BERT and how to handle the class imbalance issue while doing multi text classification? 🤗Transformers	2	420	May 14, 2025
Getting 40% accuracy. Need suggestions to improve! Beginners	12	3011	December 7, 2023
Unbalanced training with BERT 🤗Transformers	0	701	July 27, 2020
Text classifier is trained incorrectly using BERT transformers (f1 = 0) for a certain amount of dataset 🤗Transformers	2	828	August 31, 2023

Dealing with Imbalanced Datasets?

Related topics