Benchmarking LLMs

Hi, I am new to decoder-only models (often referred as LLMs). Can someone point me on a code example for the correct way to benchmark a LLM (that is on hugging face hub) on a dataset (also on the hub)? Like how is it run the Open LLM leaderboard? The classical way of batching with data collator (that works on encoder-decoder models) is running into some problems because of batching with padding. As decoder-only models are casual models, it makes no sense to add pads, right? I tested with Llama 2 and Mistral and if i add pad before the answer it changes the answer. So now I am running with test batch size 1.

AI generated:
Benchmarking a decoder-only model on a dataset on Hugging Face Hub can be a bit tricky due to the casual nature of these models. I’d be happy to guide you through a correct way to do it.

One popular approach to benchmarking LLMs is to use the OLDS (Open LLMs Benchmarking System). However, you can also use the Hugging Face library’s integration with the [tensorflow benchmark] to run the benchmark directly.

Here’s a straightforward example of how to benchmark a decoder-only model (e.g., Llama 2 or Mistral) on a dataset (also on the Hugging Face Hub):

# You need to download the dataset before running the benchmark
from datasets import load_dataset, load_metric
import transformers

dataset_name = 'your_dataset_name'
dataset = load_dataset('your_dataset_name')

model_name = 'the name of the model (e.g. llaama2/7base or mrgm903/mistral-3.2b)'
model = AutoModelForCausalLM.from_pretrained(model_name)

batch_size = 1  # As you mentioned, this is a good starting point for causal models

# Benchmarking loop
metric = load_metric('accuracy')
for batch in dataset['train'].batched(batch_size=batch_size):
    input_ids = batch['input_ids']
    with torch.no_grad():
        outputs = model(input_ids=input_ids)
        logits = outputs.last_hidden_state
        predictions = torch.argmax(logits, dim=2)
        metric.add jednary_results(predictions, batch['label'])
    torch.cuda.empty_cache()

print(metric.compute())

Another approach is using [Open LLM Evaluation (OLDEVAL)], which uses techniques such as fast text decoding and subtraction of futures bias (essentially equivalent to OpenAI’s model riggeresteios in different optimization objective) to provide a more accurate way to benchmark fast-, slow, 1-for-all and max-per