Choosing Benchmarks for Fine-Tuned Models in Emotion Analysis

Hello Hugging Face community,

I’m working on my master’s thesis, and I need your advice regarding the best way to validate my chosen models. My thesis focuses on emotion analysis in text(e.g., positive, negative, or more types of emotions). I’ve narrowed down my choices to 5 fine-tuned models from Hugging Face, but I’m facing challenges in selecting 3–4 benchmarks to evaluate them.

Here’s my situation:

  1. Some of the models don’t have clearly documented benchmarks.
  2. Others have benchmarks that are specific to their fine-tuning tasks, but these don’t overlap across all models.
  3. The models share base models (e.g., DistilBERT, RoBERTa), but it feels like using benchmarks of the base models might not align with my goal.

My Questions:

  1. Would it make sense to evaluate the fine-tuned models on the benchmarks of their base models, or is this approach flawed for emotion analysis tasks?
  2. Should I focus on choosing a smaller set of models with entirely different base models to ensure diversity in evaluation?
  3. How would you recommend selecting 3–4 benchmarks that are suitable for comparing models fine-tuned for diverse tasks (e.g., general sentiment, social media, or domain-specific emotion analysis)?

My goal is to compare these models effectively for emotion analysis tasks while maintaining scientific rigor. Any suggestions on benchmarks or how to approach this would be greatly appreciated!

1 Like