Why is my DistilBERT model performing poorly on some classes despite hyperparameter tuning?

I am working on an emotion classification task using DistilBERT, with data collected from multiple sources. My dataset is balanced across all emotion categories, so class imbalance should not be a major issue.

However, after trying multiple hyperparameter settings, the model consistently performs poorly overall (low accuracy: 48%) and only predicts certain categories well while failing on others.
What I have tried so far is:

  1. Using learning rates from 1e-06 to 5e-05
  2. Batch size: 16,32,64
  3. weight decay: 0.1, 0.01,0.03
  4. optimizer: Adem
  5. scheduler type: cosine, linear
  6. epoch: 2,4,5,8,10.

    Currently, the best performance is 48%, and the classification report is as follows:
1 Like

Hello,
What is the size of your training set and your test set? How many samples do you have?
It seems your learning rate is low and perhaps you will need more epochs depending on the size of your training and test set.
Regards

1 Like

Hi, thanks for your response.
I have about 9880 rows of training samples and 2470 rows of testing samples.

1 Like

Hi,

You commented your dataset is balanced, but the model seems biased toward disgust and shame, while sadness and joy have very low recall. This could be due to ambiguous text or varied expressions making them harder to learn.

Have you checked the loss curve for underfitting or overfitting? Since DistilBERT is a smaller model, it may need more than 10 epochs to generalize well. Analyzing misclassified samples might reveal patterns causing these errors. Also, you could try increasing the learning rate slightly (e.g., 5e-4 to 5e-3) to speed up learning and accelerate convergence, even if it sacrifices some fine-tuning precision.

Hope this helps!

1 Like

yaa, I just checked the curve and found that the model is underfitting. I have try for 5e-3 and epoch for 12, but erm it seems like my training epoch is less and learning rate is too high, the accuracy drop to 16%.


I might try for 5e-4 and epoch 12 first to see if it is okay.
Anyways, thanks for your help in advance.

1 Like

Hmmm, it looks like the loss drops very fast in the first epoch and then stays flat. I guess it could indicate an issue with the data.
Do you fully trust the labels? It might be helpful to manually inspect some samples from problematic classes (e.g., anger, fear, joy) to see if there are inconsistencies or ambiguous cases.

Could you also share the confusion matrix? It might give more insight into which classes the model is confusing the most.

1 Like

This is the confusion matrix when I try for 5e-3 and epoch 12


While I try for other set, I found that there is a bias for the label anger and fear (which accuracy is 49%).

1 Like

While the dataset for label anger and fear is come from CARER dataset, and I manually inspect for it also doesn’t seems any problem :thinking:

1 Like

Wait, I think I might found some reason? cause I have sorted my dataset based on the category before, so I think it will be the reason of this bias condition?

1 Like

Yes, sorting the dataset by category before splitting into train and test could definitely cause this bias. If the split wasn’t random, your model might be training only on certain classes and testing on others, which would explain the poor performance on some emotions.
Also, double-check that sorting didn’t accidentally change the alignment of texts and labels, as that could introduce incorrect labels. Try reshuffling the dataset and making sure the train-test split is random to see if performance improves.

2 Likes

Thank you @ddrbcn I have try for reshuffling and also random train-test split, but the result also still maintain 49%, while the confusion matrix is slightly better


I think is my dataset quality problem, the disgust and shame might be too easier to learn compared to other 4 category? Anyways, I will keep training while also looking for another dataset that contain for the same category as mine.

1 Like

You’re welcome! I’m glad to hear that reshuffling and a random train-test split have improved the confusion matrix, even if accuracy is still low.
You could try experimenting again with different learning rates and other hyperparameters using this new split to see if you get better results. Your idea of testing with another dataset sounds also like a good approach

Regarding to your second point, disgust and shame might be easier for the model to learn, but I find it interesting that it struggles with joy. In theory, the type of text in that category should be quite distinct to all teh remaining classes. I suggest focusing on joy and checking if there might be some labeling inconsistencies or ambiguous samples in that class.

2 Likes

Hi @ddrbcn, I have manually check for the dataset again, and I found that there are a mistake when i am trying to extract the row from the original dataset, which have make the label to be mixed up and inconsistent with the original data. And now after I carefully change back the label, the accuracy is up. So sorry for making this kind of error and really appreciate for your effort and time to help me.

1 Like

Please do not mention it! The reason I insisted on checking the labels and suggested verifying if sorting or something else had misaligned them was because I’ve made similar mistakes in the past. Those experiences taught me valuable lessons, and learning from errors is just part of the journey.

What really matters is being open to investigating issues and asking for help when needed. I’ve also received a lot of support from different tech communities over time, and that’s the beauty and the power of collective knowledge—we all grow together.

It’s been a pleasure helping you, and I’m really glad you found the issue! If everything is working now, you might want to mark the topic as solved. Best of luck with your project!

1 Like

Really appreciate your support! Wishing you smooth progress and great success in all your projects too!

2 Likes

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.