Reasoning LLM Benchmarking

rgb2gbr · March 24, 2025, 11:11am

I’m training/fine-tuning a Qwen model using GRPO(Group Relative Policy Optimisation) it performs well and answers nearly all the GSM8K test questions correctly.

But for the benchmarking, traditional ways/scripts wont work, since it just looks for numeric value but the output is verbose, a reasoning step by step result, thus the benchmark identifies the output incorrectly.

how they’d do the benchmarking because a script only goes through the answer and looking for a numeric value, it cannot understand the correct answer because most of the time its a reasoning or verbose output
is there a legit way to measure/evaluate and benchmark the the reasoning models?

John6666 · March 24, 2025, 11:48am

Hmm, for example, could we use the benchmark for the reasoning model used in these leaderboards?

rgb2gbr · March 24, 2025, 12:16pm

thanks for the suggestion, the thing is need to regularly run multiple benchmark to see performance and improvement and i also need to demonstrate a well-known method

Topic		Replies	Views
Why can't I reproduce benchmark scores from papers like Phi, Llama, or Qwen? Am I doing something wrong or is this normal? Models	2	50	June 10, 2025
Evaluate fine-tuned LLM for question answering Beginners	1	48	May 2, 2025
Causal LLM benchmarks Beginners	0	456	June 13, 2023
Which LLM Works Best for Prompt and Response Generation in Chinese (Simplified and Traditional) Languages at Hugging Face	3	799	January 23, 2025
Evaluating my own model Intermediate	6	112	February 21, 2025

Reasoning LLM Benchmarking

Related topics