Benchmarking LLMs

Hi, I am new to decoder-only models (often referred as LLMs). Can someone point me on a code example for the correct way to benchmark a LLM (that is on hugging face hub) on a dataset (also on the hub)? Like how is it run the Open LLM leaderboard? The classical way of batching with data collator (that works on encoder-decoder models) is running into some problems because of batching with padding. As decoder-only models are casual models, it makes no sense to add pads, right? I tested with Llama 2 and Mistral and if i add pad before the answer it changes the answer. So now I am running with test batch size 1.