Benchmarking LLMs

thomas-ferraz · November 4, 2023, 1:16am

Hi, I am new to decoder-only models (often referred as LLMs). Can someone point me on a code example for the correct way to benchmark a LLM (that is on hugging face hub) on a dataset (also on the hub)? Like how is it run the Open LLM leaderboard? The classical way of batching with data collator (that works on encoder-decoder models) is running into some problems because of batching with padding. As decoder-only models are casual models, it makes no sense to add pads, right? I tested with Llama 2 and Mistral and if i add pad before the answer it changes the answer. So now I am running with test batch size 1.

hubsnippetai · August 20, 2024, 3:59am

AI generated:
Benchmarking a decoder-only model on a dataset on Hugging Face Hub can be a bit tricky due to the casual nature of these models. I’d be happy to guide you through a correct way to do it.

One popular approach to benchmarking LLMs is to use the OLDS (Open LLMs Benchmarking System). However, you can also use the Hugging Face library’s integration with the [tensorflow benchmark] to run the benchmark directly.

Here’s a straightforward example of how to benchmark a decoder-only model (e.g., Llama 2 or Mistral) on a dataset (also on the Hugging Face Hub):

# You need to download the dataset before running the benchmark
from datasets import load_dataset, load_metric
import transformers

dataset_name = 'your_dataset_name'
dataset = load_dataset('your_dataset_name')

model_name = 'the name of the model (e.g. llaama2/7base or mrgm903/mistral-3.2b)'
model = AutoModelForCausalLM.from_pretrained(model_name)

batch_size = 1  # As you mentioned, this is a good starting point for causal models

# Benchmarking loop
metric = load_metric('accuracy')
for batch in dataset['train'].batched(batch_size=batch_size):
    input_ids = batch['input_ids']
    with torch.no_grad():
        outputs = model(input_ids=input_ids)
        logits = outputs.last_hidden_state
        predictions = torch.argmax(logits, dim=2)
        metric.add jednary_results(predictions, batch['label'])
    torch.cuda.empty_cache()

print(metric.compute())

Another approach is using [Open LLM Evaluation (OLDEVAL)], which uses techniques such as fast text decoding and subtraction of futures bias (essentially equivalent to OpenAI’s model riggeresteios in different optimization objective) to provide a more accurate way to benchmark fast-, slow, 1-for-all and max-per

Topic		Replies	Views
Causal LLM benchmarks Beginners	0	456	June 13, 2023
Why can't I reproduce benchmark scores from papers like Phi, Llama, or Qwen? Am I doing something wrong or is this normal? Models	2	57	June 10, 2025
Results of model.generate are different for different batch sizes of the decode-only model Beginners	6	6011	April 14, 2024
Replacing the LlamaDecoderLayer Class hugging Face With New LongNet Intermediate	0	809	March 30, 2024
Best Local LLM for Real-Time Q&A on German/English Transcript? Models	1	45	June 19, 2025

Benchmarking LLMs

Related topics