Metrics for Training Set in Trainer

Hey guys,

I am currently using the Trainer in order to train my DistilBertForSequenceClassification.

My problem: I want to stepwise print/save the loss and accuracy of my training set by using the Trainer. Is there a way to do so?

What I did so far: I have adjusted compute_metrics. But this function is only carried out on my evaluation set. I need the same for my training set. I also tried out the TrainerCallback. But I can’t access the current predictions of the model by using the predefined callbacks.

Another idea would be to customize the Trainer using a custom train function. But firstly I want to ask you whether there is an easier way to do so.

Thank you in advance!

There is no way to do this directly in the Trainer, it’s just not built that way (because evaluation is often pretty slow). You should twek the code in your own subclass of Trainer to add a self.evaluate(self.train_dataset) at the appropriate line and then handle the logging.

1 Like

Thank you very much for your fast reply!
Just to be sure that my problem was correctly described in my post above: My intention is that I can print, plot and monitor the loss and accuracy of my training set while training by using the Trainer. Currently only the loss of my training dataset is printed while carrying out the training with the Trainer. If I would use the Fine-tuning with native PyTorch I can add an accuracy function in the training-loop, which also calculates the accuracy (or other metrics) on my training-set per epoch besides the loss. In the end, I want to ask whether it is easily possible to do the same with my Trainer.

To sum up, I dont want to newly evaluate something, because the model is predicting the labels in order to calculate the loss anyway. I just want to further use this predictions to calculate metrics on my training set.

Sorry for probably being unclear. Have a great weekend!

1 Like

I know this is an old issue, but I came across this while trying to determine the best way to track metrics besides the loss during training. I thought I’d post what I came up with in case it helps someone else.

To reiterate the context, like @Bumblebert, I’m interested in running additional metrics on the outputs that the model already computes during training, rather than running an additional evaluation run over the entire training set (using, e.g., self.evaluate(self.train_dataset)). My use case is that I’m training a multiple choice model and I’d like to see how the accuracy changes while training.

I’ve found the suggestion in the Trainer class to “Subclass and override for custom behavior.” to be a good idea a couple of times now :slight_smile: To compute custom metrics, I found where the outputs were easily accessible, in compute_loss(), and added some code. I’ve prefixed MAX: to my comments below:

class CustomTrainer(Trainer):

    def compute_loss(self, model, inputs, return_outputs=False):
        """
        MAX: Subclassed to compute training accuracy.

        How the loss is computed by Trainer. By default, all models return the loss in
        the first element.

        Subclass and override for custom behavior.
        """
        if self.label_smoother is not None and "labels" in inputs:
            labels = inputs.pop("labels")
        else:
            labels = None
        outputs = model(**inputs)

        # MAX: Start of new stuff.
        if "labels" in inputs:
            preds = outputs.logits.detach()
            acc = (
                (preds.argmax(axis=1) == inputs["labels"])
                .type(torch.float)
                .mean()
                .item()
            )
            self.log({"accuracy": acc})
        # MAX: End of new stuff.

        # Save past state if it exists
        # TODO: this needs to be fixed and made cleaner later.
        if self.args.past_index >= 0:
            self._past = outputs[self.args.past_index]

        if labels is not None:
            loss = self.label_smoother(outputs, labels)
        else:
            # We don't use .loss here since the model may return tuples instead of
            # ModelOutput.
            loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

        return (loss, outputs) if return_outputs else loss

Then, I instantiate a CustomTrainer instead of a Trainer and run as normal.

(Note that the above code isn’t battle-tested, and I only tried on a single GPU. So take it as a starting point and with a grain of salt.)

I started using the wandb plotting integration, which is sent the results of self.log() that we added, and automatically makes a plot:

(my runs were called train_metrics and train_metrics_2)

Cheers!

4 Likes

I did this by adding a custom callback which calls the evaluate() method with train_dataset at the end of every callback.

class CustomCallback(TrainerCallback):
    
    def __init__(self, trainer) -> None:
        super().__init__()
        self._trainer = trainer
    
    def on_epoch_end(self, args, state, control, **kwargs):
        if control.should_evaluate:
            control_copy = deepcopy(control)
            self._trainer.evaluate(eval_dataset=self._trainer.train_dataset, metric_key_prefix="train")
            return control_copy

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
    tokenizer=tokenizer
)
trainer.add_callback(CustomCallback(trainer)) 
train = trainer.train()

This gives the train metrics like following:

{‘train_loss’: 0.7159061431884766, ‘train_accuracy’: 0.4, ‘train_f1’: 0.5714285714285715, ‘train_runtime’: 6.2973, ‘train_samples_per_second’: 2.382, ‘train_steps_per_second’: 0.159, ‘epoch’: 1.0}
{‘eval_loss’: 0.8529007434844971, ‘eval_accuracy’: 0.0, ‘eval_f1’: 0.0, ‘eval_runtime’: 2.0739, ‘eval_samples_per_second’: 0.964, ‘eval_steps_per_second’: 0.482, ‘epoch’: 1.0}

8 Likes

Hello,

How do you do to get the train accuracy score, I just get the train loss when using the trainer() ?.

I will expand on @sid8491’s answer. In my use case, I have to keep finetuning on multiple datasets from different languages. This way, we can keep track of metrics (loss, precision, recall, …) across different language datasets.

If anyone has suggestions for cleaner code, please do suggest.

from sklearn.metrics import precision_recall_fscore_support, accuracy_score, log_loss
from torch.nn import CrossEntropyLoss
import numpy as np
from copy import deepcopy
from transformers import TrainerCallback

lang = 'en'

class CustomCallback(TrainerCallback):
    
    def __init__(self, trainer) -> None:
        super().__init__()
        self._trainer = trainer
    
    def on_epoch_end(self, args, state, control, **kwargs):
        if control.should_evaluate:
            control_copy = deepcopy(control)
            self._trainer.evaluate(eval_dataset=self._trainer.train_dataset, metric_key_prefix="train@"+lang)
            return control_copy

def compute_metrics(pred):
    global num_labels
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    loss_fct = CrossEntropyLoss()
    logits = torch.tensor(pred.predictions)
    labels = torch.tensor(labels)
    loss = loss_fct(logits.view(-1, num_labels), labels.view(-1))
    return {
        'accuracy@'+lang: acc,
        'f1@'+lang: f1,
        'precision@'+lang: precision,
        'recall@'+lang: recall,
        'loss@'+lang: loss,
    }

training_args = TrainingArguments(
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=1e-4,
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    output_dir=MODEL_DIR+'_EN',
    overwrite_output_dir=True,
    remove_unused_columns=False,
    save_total_limit=1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=en_train_dataset,
    eval_dataset=en_valid_dataset,
    compute_metrics=compute_metrics
)

trainer.add_callback(CustomCallback(trainer)) 

train_result = trainer.train()

trainer.evaluate(metric_key_prefix='test_en',
                eval_dataset=en_test_dataset)
2 Likes

Thank you. This helps me a lot

That’s a good way to do it. But I guess the problem with that is that you are training, then evaluating on the training data, meaning that you are running through the data twice, instead of just collecting the results of the training.

I answered this problem for getting metrics for epoch and for batch, also getting other metrics f1, …
here Batch and Epoch training metrics for transformers Trainer - Stack Overflow