Handling Extreme Class Imbalance for Multi-Class Classification

nikhilhuggingface96 · May 14, 2025, 7:21am

I had a task of fine tuning a model for sequence classification. I wanted the model to be able to reroute the ticket to appropriate target class. I have 106 output classes, and highly imbalanced dataset, like 23k records for some class, and 2 records for some other class. I tried different models like distilbert-base-uncased, bert-base, deberta, roberta, bigbird, with different hyperparameter combinations, and different loss functions like focal loss, weighted loss etc., but I am not able to break the accuracy mark of 84 %. For handling class imbalance, I created a synthetic data creation pipeline, that paraphrased my inputs to generate new samples, but as you can see, some of my classes has 2, 10, 100, etc. training samples. Now, if I try to create new samples, it causes severe overfitting , as there are not enough examples to go on with. Also, oversampling and under sampling would also not work here. If someone can help me in this scenario, your help would be greatly appreciated

John6666 · May 14, 2025, 11:44am

It may be difficult to use the embedding model alone. Perhaps you will have to divide it into several stages or use a larger model…
https://datascience.stackexchange.com/questions/71558/text-classification-into-thousands-of-classes

https://www.mdpi.com/2079-9292/13/7/1199

Topic		Replies	Views
Does high number of output labels affect the performance of BERT and how to handle the class imbalance issue while doing multi text classification? 🤗Transformers	2	419	May 14, 2025
How to dealing with Data Imbalance 🤗Datasets	2	6330	July 28, 2020
Getting 40% accuracy. Need suggestions to improve! Beginners	12	3008	December 7, 2023
Multi-class Classification Basics Beginners	4	4543	August 24, 2021
Handling Imbalanced Dataset 🤗Datasets	0	170	June 20, 2024

Handling Extreme Class Imbalance for Multi-Class Classification

Related topics