More insight into benchmarks/leaderboard -- individual task performance by model

I understand that if you submit your model to the leaderboard, it’s benchmarked according to various sets of tasks. The leaderboard links to those sets of tasks - you can see what constitutes each of them. But the information released about individual model performance is very aggregated (overall score on each set of tasks), and what I’m hoping to find/create is model results/performance for specific tasks. Like, I want to be able to, for a set of specific models, see actual performance on individual tasks.

My first thought on this was to use the inference API to just feed in each task I was interested in and iterate through each model of interest to get the results. This works, but because the inference API is limited to just the smaller models, those are the only models I can use.

Is there another solution here that doesn’t involve spinning up each model of interest (either on my own AWS resources or via paying HuggingFace to manage it)? I’m trying to better understand/describe the differences between, say, a 7B vs. 40B model beyond just, say, “this one benchmarks at 50 on this assessment and this one at 60.”

Thank you.