Cannot replicate leaderboard MATH scores

denizyuret · March 23, 2025, 2:09pm

Can somebody point to the exact github commit for lm-eval and command line options they use for MATH? Here is my attempt with latest commit of lm-evaluation-harness, and I cannot get from these numbers to the number posted on the leaderboard.

$ lm_eval --model hf --model_args pretrained=nvidia/AceMath-1.5B-Instruct --tasks leaderboard_math_hard --device cuda:0 --batch_size auto --trust_remote_code --log_samples --output_path foo-results
|                    Tasks                    |Version|Filter|n-shot|  Metric   |   |Value |   |Stderr|
|---------------------------------------------|-------|------|-----:|-----------|---|-----:|---|-----:|
|leaderboard_math_hard                        |    N/A|      |      |           |   |      |   |      |
| - leaderboard_math_algebra_hard             |      2|none  |     4|exact_match|↑  |0.5407|±  |0.0285|307
| - leaderboard_math_counting_and_prob_hard   |      2|none  |     4|exact_match|↑  |0.3008|±  |0.0415|123
| - leaderboard_math_geometry_hard            |      2|none  |     4|exact_match|↑  |0.1212|±  |0.0285|132
| - leaderboard_math_intermediate_algebra_hard|      2|none  |     4|exact_match|↑  |0.1000|±  |0.0180|280
| - leaderboard_math_num_theory_hard          |      2|none  |     4|exact_match|↑  |0.2078|±  |0.0328|154
| - leaderboard_math_prealgebra_hard          |      2|none  |     4|exact_match|↑  |0.5026|±  |0.0361|193
| - leaderboard_math_precalculus_hard         |      2|none  |     4|exact_match|↑  |0.1185|±  |0.0279|135

John6666 · March 23, 2025, 3:53pm

Oh… Perhaps fixed glitch?

Topic		Replies	Views
Can't reproduce Open LLM Leaderboard v2 normalized scores Beginners	3	258	September 4, 2024
Why can't I reproduce benchmark scores from papers like Phi, Llama, or Qwen? Am I doing something wrong or is this normal? Models	2	56	June 10, 2025
Open-LLM-Leaderboard for dummies Intermediate	3	331	December 30, 2024
EleutherAI / lm-evaluation-harness on a custom model Models	0	1954	April 10, 2024
T5 submission fails on OpenLLM leaderboard 🤗Transformers	0	163	November 21, 2023

Cannot replicate leaderboard MATH scores

Related topics