Evaluate fine-tuned LLM for question answering

How can I evaluate the output of a fine-tuned LLM for question answering on a test set? The test set has varying lengths of output, hence, what value should I set for max length? If I set it too big, the output is much longer than the reference and if too small, the output can be less or it abruptly stops. I would greatly appreciate input on this. Also, would BLEU be a suitable metric?

1 Like

It might be a good idea to use the benchmarks used in the leaderboard for your own LLM.

This will allow you to understand the relative performance of your own LLM.