Hi!
I’ve developed ModelClash, an open-source framework for LLM evaluation that could offer some potential advantages over static benchmarks:
- Automatic challenge generation, reducing manual effort
- Should scale with advancing model capabilities
- Evaluates both problem creation and solving skills
The project is in early stages, but initial tests with GPT and Claude models show promising results.
I would be very happy to hear your honest thoughts on this. Also, I’m new to Huggingface so if you know of any better place here to share this, please let me know.