Metrics for Training Set in Trainer

Bumblebert · December 3, 2020, 10:11am

Hey guys,

I am currently using the Trainer in order to train my DistilBertForSequenceClassification.

My problem: I want to stepwise print/save the loss and accuracy of my training set by using the Trainer. Is there a way to do so?

What I did so far: I have adjusted compute_metrics. But this function is only carried out on my evaluation set. I need the same for my training set. I also tried out the TrainerCallback. But I can’t access the current predictions of the model by using the predefined callbacks.

Another idea would be to customize the Trainer using a custom train function. But firstly I want to ask you whether there is an easier way to do so.

Thank you in advance!

sgugger · December 3, 2020, 2:04pm

There is no way to do this directly in the Trainer, it’s just not built that way (because evaluation is often pretty slow). You should twek the code in your own subclass of Trainer to add a self.evaluate(self.train_dataset) at the appropriate line and then handle the logging.

Bumblebert · December 4, 2020, 12:14pm

Thank you very much for your fast reply!
Just to be sure that my problem was correctly described in my post above: My intention is that I can print, plot and monitor the loss and accuracy of my training set while training by using the Trainer. Currently only the loss of my training dataset is printed while carrying out the training with the Trainer. If I would use the Fine-tuning with native PyTorch I can add an accuracy function in the training-loop, which also calculates the accuracy (or other metrics) on my training-set per epoch besides the loss. In the end, I want to ask whether it is easily possible to do the same with my Trainer.

To sum up, I dont want to newly evaluate something, because the model is predicting the labels in order to calculate the loss anyway. I just want to further use this predictions to calculate metrics on my training set.

Sorry for probably being unclear. Have a great weekend!

mbforbes · August 10, 2021, 5:09pm

I know this is an old issue, but I came across this while trying to determine the best way to track metrics besides the loss during training. I thought I’d post what I came up with in case it helps someone else.

To reiterate the context, like @Bumblebert, I’m interested in running additional metrics on the outputs that the model already computes during training, rather than running an additional evaluation run over the entire training set (using, e.g., self.evaluate(self.train_dataset)). My use case is that I’m training a multiple choice model and I’d like to see how the accuracy changes while training.

I’ve found the suggestion in the Trainer class to “Subclass and override for custom behavior.” to be a good idea a couple of times now To compute custom metrics, I found where the outputs were easily accessible, in compute_loss(), and added some code. I’ve prefixed MAX: to my comments below:

class CustomTrainer(Trainer):

    def compute_loss(self, model, inputs, return_outputs=False):
        """
        MAX: Subclassed to compute training accuracy.

        How the loss is computed by Trainer. By default, all models return the loss in
        the first element.

        Subclass and override for custom behavior.
        """
        if self.label_smoother is not None and "labels" in inputs:
            labels = inputs.pop("labels")
        else:
            labels = None
        outputs = model(**inputs)

        # MAX: Start of new stuff.
        if "labels" in inputs:
            preds = outputs.logits.detach()
            acc = (
                (preds.argmax(axis=1) == inputs["labels"])
                .type(torch.float)
                .mean()
                .item()
            )
            self.log({"accuracy": acc})
        # MAX: End of new stuff.

        # Save past state if it exists
        # TODO: this needs to be fixed and made cleaner later.
        if self.args.past_index >= 0:
            self._past = outputs[self.args.past_index]

        if labels is not None:
            loss = self.label_smoother(outputs, labels)
        else:
            # We don't use .loss here since the model may return tuples instead of
            # ModelOutput.
            loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

        return (loss, outputs) if return_outputs else loss

Then, I instantiate a CustomTrainer instead of a Trainer and run as normal.

(Note that the above code isn’t battle-tested, and I only tried on a single GPU. So take it as a starting point and with a grain of salt.)

I started using the wandb plotting integration, which is sent the results of self.log() that we added, and automatically makes a plot:

(my runs were called train_metrics and train_metrics_2)

Cheers!

sid8491 · January 3, 2022, 10:05am

I did this by adding a custom callback which calls the evaluate() method with train_dataset at the end of every callback.

class CustomCallback(TrainerCallback):
    
    def __init__(self, trainer) -> None:
        super().__init__()
        self._trainer = trainer
    
    def on_epoch_end(self, args, state, control, **kwargs):
        if control.should_evaluate:
            control_copy = deepcopy(control)
            self._trainer.evaluate(eval_dataset=self._trainer.train_dataset, metric_key_prefix="train")
            return control_copy

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
    tokenizer=tokenizer
)
trainer.add_callback(CustomCallback(trainer)) 
train = trainer.train()

This gives the train metrics like following:

{‘train_loss’: 0.7159061431884766, ‘train_accuracy’: 0.4, ‘train_f1’: 0.5714285714285715, ‘train_runtime’: 6.2973, ‘train_samples_per_second’: 2.382, ‘train_steps_per_second’: 0.159, ‘epoch’: 1.0}
{‘eval_loss’: 0.8529007434844971, ‘eval_accuracy’: 0.0, ‘eval_f1’: 0.0, ‘eval_runtime’: 2.0739, ‘eval_samples_per_second’: 0.964, ‘eval_steps_per_second’: 0.482, ‘epoch’: 1.0}

emmakelo · January 17, 2022, 2:37pm

Hello,

How do you do to get the train accuracy score, I just get the train loss when using the trainer() ?.

Kaveri · January 5, 2023, 2:24pm

I will expand on @sid8491’s answer. In my use case, I have to keep finetuning on multiple datasets from different languages. This way, we can keep track of metrics (loss, precision, recall, …) across different language datasets.

If anyone has suggestions for cleaner code, please do suggest.

from sklearn.metrics import precision_recall_fscore_support, accuracy_score, log_loss
from torch.nn import CrossEntropyLoss
import numpy as np
from copy import deepcopy
from transformers import TrainerCallback

lang = 'en'

class CustomCallback(TrainerCallback):
    
    def __init__(self, trainer) -> None:
        super().__init__()
        self._trainer = trainer
    
    def on_epoch_end(self, args, state, control, **kwargs):
        if control.should_evaluate:
            control_copy = deepcopy(control)
            self._trainer.evaluate(eval_dataset=self._trainer.train_dataset, metric_key_prefix="train@"+lang)
            return control_copy

def compute_metrics(pred):
    global num_labels
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    loss_fct = CrossEntropyLoss()
    logits = torch.tensor(pred.predictions)
    labels = torch.tensor(labels)
    loss = loss_fct(logits.view(-1, num_labels), labels.view(-1))
    return {
        'accuracy@'+lang: acc,
        'f1@'+lang: f1,
        'precision@'+lang: precision,
        'recall@'+lang: recall,
        'loss@'+lang: loss,
    }

training_args = TrainingArguments(
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=1e-4,
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    output_dir=MODEL_DIR+'_EN',
    overwrite_output_dir=True,
    remove_unused_columns=False,
    save_total_limit=1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=en_train_dataset,
    eval_dataset=en_valid_dataset,
    compute_metrics=compute_metrics
)

trainer.add_callback(CustomCallback(trainer)) 

train_result = trainer.train()

trainer.evaluate(metric_key_prefix='test_en',
                eval_dataset=en_test_dataset)

qiying · February 26, 2024, 10:31am

Thank you. This helps me a lot

jaandoui · April 11, 2024, 12:53pm

That’s a good way to do it. But I guess the problem with that is that you are training, then evaluating on the training data, meaning that you are running through the data twice, instead of just collecting the results of the training.

jaandoui · April 11, 2024, 9:11pm

I answered this problem for getting metrics for epoch and for batch, also getting other metrics f1, …
here Batch and Epoch training metrics for transformers Trainer - Stack Overflow

tcm03 · February 13, 2025, 6:13am

Hi,
I found your solution appropriate in my case because for a large training set, it would be costly in terms of time and money to go through the entire training set the second time for evaluation.
However, as far as I know, the compute_loss function is specific to a batch only (it is contained in either training_step() or prediction_step() for in-batch processing), so if we calculate the metrics in that function, the result is over that batch. For evaluating the metrics over the entire training set, how can we modify the code without going through the training set the second time as mentioned by @sgugger ?
Thanks

pmulcaire · March 14, 2025, 9:41pm

Very old thread now, but I found mbforbes’ solution above to work well. But compute_loss() is called for both training and eval steps, so both will end up in your logs, which may not be what you want. To solve this my very hacky solution was to compute a hash for each eval example, save the list of such hashes in the Trainer subclass, and check against that list before computing the metric:

class CustomTrainer(Trainer):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.eval_ids = args["eval_ids"]

    def compute_loss(self, model, inputs, return_outputs=False):
        [...]
        if "labels" in inputs:
            example_id = hashlib.md5(json.dumps(inputs["input_ids"][0].tolist()).encode()).hexdigest() # get the first instance in the batch, as a representative
             if example_id not in self.eval_ids:
                 # compute and log metrics
            [...]

fwiw!

Topic		Replies	Views
Logging training accuracy using Trainer class 🤗Transformers	8	10550	December 2, 2021
Trainer API to log both Training and Validation Metrics 🤗Transformers	2	1708	July 1, 2021
How to monitor both train and validation metrics at the same step? 🤗Transformers	21	15385	July 29, 2021
Plotting train accuracy and loss with Trainer 🤗Transformers	2	3372	February 27, 2024
Custom model for Trainer 🤗Transformers	1	400	July 8, 2023

Metrics for Training Set in Trainer

Related topics