I am finetuning Llama2 for question answering. I have also made a dataset for its training purpose. I have fine-tuned it previously but without a compute_metrics
. I am curious what metrics should I take , what are the options available for QA.
PS: I am training for health care , so what metrics seems to be suitable for its evaluation.