Why can't I reproduce benchmark scores from papers like Phi, Llama, or Qwen? Am I doing something wrong or is this normal?

jh-lim · May 27, 2025, 4:49pm

I’m working on evaluating open-source LLMs (e.g., Phi, Llama, Qwen), and I’ve noticed that the benchmark scores I get are consistently different from the ones reported in their tech reports or papers — sometimes by a wide margin.

Sometimes the results are lower than expected, but surprisingly, sometimes they’re higher too. My point is: there are (many) cases where the difference is quite large, and it’s not clear why.

I’ve tried:

Using lm-eval-harness with the default settings
Matching tokenizers and prompt formats as best as possible
Evaluating on standard benchmarks like MMLU, GSM8K, ARC, etc, in the reports under the same few-shot conditions

Despite this, the scores I get are often significantly different from what’s published — and I can’t find any official scripts or clear explanations of the exact benchmarking setup used in those papers.

This seems to happen not just with one model, but across many open-source models.

Is this a common experience in the community?

Are papers using special prompt engineering or internal eval setups they don’t release?
Am I missing some key benchmarking tricks?
Is this just part of the game at this point?

Would really appreciate if anyone can share:

Experience trying to reproduce scores
Any evaluation tips
Benchmarking setups that actually match reported numbers

Thanks in advance!

John6666 · May 28, 2025, 12:55am

The backend, or rather, if the library version or options passed to the generation function (such as temperature) are different, the results may vary, so I think it can only be used as a rough guide. Leaderboards are easy to compare because they use the same criteria within the same leaderboard, but I don’t think there are many absolute indicators that can be used. For large companies, I think the output of the endpoints officially provided by the company can be used as a reference.

Pimpcat-AU · June 10, 2025, 7:54pm

Yes, this is normal. Benchmark results in LLM papers are often not exactly reproducible due to differences in evaluation code, tokenizer versions, prompt formatting, dataset splits, hardware, random seeds, and sometimes undocumented “internal” settings or prompt engineering. Even “official” scripts can produce different results if any dependency changes.

Tips:

Use the exact tokenizer and prompt format as the original paper or repo.

Check for hidden preprocessing, special instructions, or test set modifications.

Use the same version of lm-eval-harness and its dependencies.

Run multiple seeds and average results if possible.

Even then, you may still see margin differences. This is a common and known issue in the LLM community.

Solution provided by Triskel Data Deterministic AI.

Topic		Replies	Views
Reasoning LLM Benchmarking 🤗Transformers	2	860	March 24, 2025
Can't reproduce Open LLM Leaderboard v2 normalized scores Beginners	3	207	September 4, 2024
Different results from checkpoint evaluation when loading fine-tuned LLM model Intermediate	5	3223	September 22, 2023
Benchmarking LLMs 🤗Transformers	1	1355	August 20, 2024
I'm a little scared, I'm new Beginners	0	80	July 31, 2024

Why can't I reproduce benchmark scores from papers like Phi, Llama, or Qwen? Am I doing something wrong or is this normal?

Related topics