How to evaluate CLMs on MMLU?

Hi folks,

I’m trying to evaluate LLMs using a pipeline on the cais/mmlu dataset. Right now, I am merging the options with the questions and passing the merged Question + Options as an input to an LLM wrapped in a pipeline API. However, there are two problems associated with this workflow.

  1. The LLMs are extremely slow (take > 60 s for a single query for a 13B model on 8x A6000 GPUs)
  2. Model outputs don’t always match exactly to one of the options in the multiple-choice option set.

How can I overcome these issues? Does anyone have any suggestions?