Hello ,
I explored the Open LLM Leaderboard and see that it is organized model-first. If you have a model in mind, you can show certain benchmarks and their scores to evaluate that model.
Is this same information organized benchmark-first or task-first elsewhere?
For example, I need a model that I provides a natural chat and can also calculate numbers accurately. I don’t have a particular model in mind. I know the task I need accomplished, but I don’t know if there is a model that does it well.
The leaderboard’s raw data could be reformatted and combined with other data to accomplish this, but I thought there may be another Space or tool that does this already.
Thank you!