Evaluate fine-tuned LLM for question answering

itskavya · May 1, 2025, 8:03pm

How can I evaluate the output of a fine-tuned LLM for question answering on a test set? The test set has varying lengths of output, hence, what value should I set for max length? If I set it too big, the output is much longer than the reference and if too small, the output can be less or it abruptly stops. I would greatly appreciate input on this. Also, would BLEU be a suitable metric?

John6666 · May 2, 2025, 6:24am

It might be a good idea to use the benchmarks used in the leaderboard for your own LLM.

This will allow you to understand the relative performance of your own LLM.

Topic		Replies	Views
How can I evaluate a fine tuned LLM? Intermediate	4	1022	January 7, 2025
Evaluating my own model Intermediate	6	115	February 21, 2025
Causal LLM benchmarks Beginners	0	456	June 13, 2023
Reasoning LLM Benchmarking 🤗Transformers	2	1318	March 24, 2025
Text generation, LLMs and fine-tuning Beginners	0	1696	December 8, 2022

Evaluate fine-tuned LLM for question answering

Related topics