Reasoning LLM Benchmarking

I’m training/fine-tuning a Qwen model using GRPO(Group Relative Policy Optimisation) it performs well and answers nearly all the GSM8K test questions correctly.

But for the benchmarking, traditional ways/scripts wont work, since it just looks for numeric value but the output is verbose, a reasoning step by step result, thus the benchmark identifies the output incorrectly.

how they’d do the benchmarking because a script only goes through the answer and looking for a numeric value, it cannot understand the correct answer because most of the time its a reasoning or verbose output
is there a legit way to measure/evaluate and benchmark the the reasoning models?

1 Like

Hmm, for example, could we use the benchmark for the reasoning model used in these leaderboards?

thanks for the suggestion, the thing is need to regularly run multiple benchmark to see performance and improvement and i also need to demonstrate a well-known method

1 Like