Handling Extreme Class Imbalance for Multi-Class Classification

I had a task of fine tuning a model for sequence classification. I wanted the model to be able to reroute the ticket to appropriate target class. I have 106 output classes, and highly imbalanced dataset, like 23k records for some class, and 2 records for some other class. I tried different models like distilbert-base-uncased, bert-base, deberta, roberta, bigbird, with different hyperparameter combinations, and different loss functions like focal loss, weighted loss etc., but I am not able to break the accuracy mark of 84 %. For handling class imbalance, I created a synthetic data creation pipeline, that paraphrased my inputs to generate new samples, but as you can see, some of my classes has 2, 10, 100, etc. training samples. Now, if I try to create new samples, it causes severe overfitting , as there are not enough examples to go on with. Also, oversampling and under sampling would also not work here. If someone can help me in this scenario, your help would be greatly appreciated

1 Like

It may be difficult to use the embedding model alone. Perhaps you will have to divide it into several stages or use a larger model…
https://datascience.stackexchange.com/questions/71558/text-classification-into-thousands-of-classes

https://www.mdpi.com/2079-9292/13/7/1199