How to use customized compute_metrics in trainer

jinjinran2001 · April 25, 2025, 9:20pm

I am doing SFT with GSM math data, I want to have the trainer calculate the actual final answer accuracy for me during evaluation on validation dataset. Is there a way to do that with compute_metric function?

Thanks!!

John6666 · April 26, 2025, 9:11am

Yeah. There is a default compute_metric function, but many people write their own, so I think that’s the safest option.

an example by Hugging Chat: HuggingChat

Yes, you can use the compute_metrics function in Hugging Face’s Trainer to calculate the final answer accuracy for your GSM math data during evaluation on the validation dataset. Here’s how:

Define Your Metrics Function: Create a custom compute_metrics function that takes the predictions and labels (ground truth answers) as inputs. For math datasets like GSM8K, this typically involves comparing the predicted answers to the correct answers.
Extract Numerical Answers: Since the model outputs are likely to be text strings, you may need to parse and extract the numerical answers from the predicted text. For example, if the model outputs “The answer is 42,” you’ll need to extract “42” and compare it to the correct answer.
Calculate Accuracy: Once you have the predicted and correct numerical answers, compute the accuracy by checking how often the predicted answer matches the correct answer.
Return Metrics: Return the computed metrics (e.g., accuracy) from the compute_metrics function. The Trainer will then automatically log these metrics during training and evaluation.

Here’s a simplified example of what the compute_metrics function might look like:

def compute_metrics(eval(predictions, labels):
    # Extract numerical answers from model predictions
    predicted_answers = [extract_number(pred) for pred in
                        predictions.predictions]
    # Compare with correct answers
    correct = sum(1 for pred, label in zip(predicted_answers, labels) if pred == label)
    accuracy = correct / len(labels)
    return {"accuracy": accuracy}

By using this approach, the Hugging Face Trainer will automatically calculate and track the accuracy of your model’s answers during evaluation on the validation dataset.

References: [1]

github.com/huggingface/evaluation-guidebook

contents/automated-benchmarks/some-evaluation-datasets.md

main

# Some evaluation datasets

If the task you are interested is already well studied, chances are that a dataset exists for it.

Below are a number of evaluation datasets which were developed in the last few years. 

However, careful:
- Some of them can be obsolete, as they were designed pre-LLM and are now easily solved: they aimed to investigate one specific property of text (translation, summarization) which is no longer really how we evaluate models (evaluations are now more general/holistic).
	 (*If you've got some bandwidth, this could really benefit from adding the publication dates!*)
	 (*This will also be updated with post LLM evals at some point*)
- They are likely contaminated, as they have been publicly on the web for a number of years. However, it doesn't mean they won't hold signal for your task!

## Math specific datasets

| Evaluation name | Task type | Publication date | Data size | Task data  | Task/Paper content  | Source | Dataset  | Comments  |
|-----             |------    |-                 |--         |------------|-------------        |--------|--------  |---------- |
| AGIEval (SATMath) | Exam dataset + existing datasets  | 2023 | 220 | Math problems from the SAT | Paper is actually a compilation of a bunch of human relative exams to use as eval data. | [Paper](https://arxiv.org/abs/2304.06364)  | [HuggingFace](https://huggingface.co/datasets/hails/agieval-sat-math)  | - Careful, this paper also includes datasets from other papers! For math, they use AIME & AMC through the MATH dataset, GRE & GMAT through the AQuA-Rat dataset, and GaoKao<br>- Metrics: acc/em/f1 |
| AIME (all)  | Olympiad dataset  | 1983-now | 15 x 2  per year  | Mathematical problems requiring a combination of arithmetic, algebra, counting, geometry, number theory, probability and other secondary school math topics  | 2nd exam to choose the US team for the International Math Olympiads | [Blog](https://artofproblemsolving.com/wiki/index.php/American_Invitational_Mathematics_Examination) | [Source](https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions) | Answer is systematically an integer between 0 and 999.  |
| AIME (22, 23 and 24)  | Olympiad dataset  | 2024 | 90 <br> | See AIME (all) | | [Paper](https://artofproblemsolving.com/wiki/index.php/American_Invitational_Mathematics_Examination)  | [HuggingFace](https://huggingface.co/datasets/AI-MO/aimo-validation-aime)  | Used in the AIMO competition  |
| ALGES (SingleEQ)  | Online sources compilations | 2015 | 508 | Grade school algebra problems extracted from sources in the web  | Paper is about implicitely learning and solving the simple equation behind the problem  | [Paper](https://aclanthology.org/Q15-1042/)  | [Source](https://gitlab.cs.washington.edu/ALGES/TACL2015/-/blob/master/questions.json?ref_type=heads)  | - Web sources:  http://math-aids.com, http://k5learning.com, and http://ixl.com<br>- Pre-LLM paper - data sources are probably OK |

This file has been truncated. show original

Topic		Replies	Views
Issue with `compute()` Method in Custom Hugging Face Metric Evaluation Spaces	3	56	January 1, 2025
Combine multiple metrics in compute_metrics for validation Beginners	1	884	June 4, 2024
Can `Trainer` be customised for curriculum learning? 🤗Transformers	0	932	August 26, 2022
How to use SARI metric in compute_metric function? Beginners	1	401	April 6, 2023
How to define the compute_metrics() function in Trainer? 🤗Transformers	3	16368	December 20, 2021

How to use customized compute_metrics in trainer

Related topics