Why can't I reproduce benchmark scores from papers like Phi, Llama, or Qwen? Am I doing something wrong or is this normal?

I’m working on evaluating open-source LLMs (e.g., Phi, Llama, Qwen), and I’ve noticed that the benchmark scores I get are consistently different from the ones reported in their tech reports or papers — sometimes by a wide margin.

Sometimes the results are lower than expected, but surprisingly, sometimes they’re higher too. My point is: there are (many) cases where the difference is quite large, and it’s not clear why.

I’ve tried:

  • Using lm-eval-harness with the default settings
  • Matching tokenizers and prompt formats as best as possible
  • Evaluating on standard benchmarks like MMLU, GSM8K, ARC, etc, in the reports under the same few-shot conditions

Despite this, the scores I get are often significantly different from what’s published — and I can’t find any official scripts or clear explanations of the exact benchmarking setup used in those papers.

This seems to happen not just with one model, but across many open-source models.

Is this a common experience in the community?

  • Are papers using special prompt engineering or internal eval setups they don’t release?
  • Am I missing some key benchmarking tricks?
  • Is this just part of the game at this point?

Would really appreciate if anyone can share:

  • Experience trying to reproduce scores
  • Any evaluation tips
  • Benchmarking setups that actually match reported numbers

Thanks in advance!

1 Like

The backend, or rather, if the library version or options passed to the generation function (such as temperature) are different, the results may vary, so I think it can only be used as a rough guide. Leaderboards are easy to compare because they use the same criteria within the same leaderboard, but I don’t think there are many absolute indicators that can be used. For large companies, I think the output of the endpoints officially provided by the company can be used as a reference.

Yes, this is normal. Benchmark results in LLM papers are often not exactly reproducible due to differences in evaluation code, tokenizer versions, prompt formatting, dataset splits, hardware, random seeds, and sometimes undocumented “internal” settings or prompt engineering. Even “official” scripts can produce different results if any dependency changes.

Tips:

Use the exact tokenizer and prompt format as the original paper or repo.

Check for hidden preprocessing, special instructions, or test set modifications.

Use the same version of lm-eval-harness and its dependencies.

Run multiple seeds and average results if possible.

Even then, you may still see margin differences. This is a common and known issue in the LLM community.

Solution provided by Triskel Data Deterministic AI.

1 Like