How can I evaluate the output of a fine-tuned LLM for question answering on a test set? The test set has varying lengths of output, hence, what value should I set for max length? If I set it too big, the output is much longer than the reference and if too small, the output can be less or it abruptly stops. I would greatly appreciate input on this. Also, would BLEU be a suitable metric?
1 Like
It might be a good idea to use the benchmarks used in the leaderboard for your own LLM.
This will allow you to understand the relative performance of your own LLM.