I’m training/fine-tuning a Qwen model using GRPO(Group Relative Policy Optimisation) it performs well and answers nearly all the GSM8K test questions correctly.
But for the benchmarking, traditional ways/scripts wont work, since it just looks for numeric value but the output is verbose, a reasoning step by step result, thus the benchmark identifies the output incorrectly.
how they’d do the benchmarking because a script only goes through the answer and looking for a numeric value, it cannot understand the correct answer because most of the time its a reasoning or verbose output
is there a legit way to measure/evaluate and benchmark the the reasoning models?