More insight into benchmarks/leaderboard -- individual task performance by model

abigailhaddad · June 16, 2023, 8:25pm

I understand that if you submit your model to the leaderboard, it’s benchmarked according to various sets of tasks. The leaderboard links to those sets of tasks - you can see what constitutes each of them. But the information released about individual model performance is very aggregated (overall score on each set of tasks), and what I’m hoping to find/create is model results/performance for specific tasks. Like, I want to be able to, for a set of specific models, see actual performance on individual tasks.

My first thought on this was to use the inference API to just feed in each task I was interested in and iterate through each model of interest to get the results. This works, but because the inference API is limited to just the smaller models, those are the only models I can use.

Is there another solution here that doesn’t involve spinning up each model of interest (either on my own AWS resources or via paying HuggingFace to manage it)? I’m trying to better understand/describe the differences between, say, a 7B vs. 40B model beyond just, say, “this one benchmarks at 50 on this assessment and this one at 60.”

Thank you.

Topic		Replies	Views
How to find a model benchmark-first or task-first Beginners	0	196	February 12, 2024
Leaderboard Details Datasets Beginners	1	73	December 20, 2024
Why can't I reproduce benchmark scores from papers like Phi, Llama, or Qwen? Am I doing something wrong or is this normal? Models	2	58	June 10, 2025
Reasoning LLM Benchmarking 🤗Transformers	2	1345	March 24, 2025
Submitted models to Open LLM Leaderboard are gone Beginners	1	448	May 26, 2024

More insight into benchmarks/leaderboard -- individual task performance by model

Related topics