Trainer never invokes compute_metrics

def compute_metrics(p: EvalPrediction):
        print("***Computing Metrics***") # THIS LINE NEVER PRINTED
        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
        preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)
        if data_args.task_name is not None:
            result = metric.compute(predictions=preds, references=p.label_ids)
            if len(result) > 1:
                result["combined_score"] = np.mean(list(result.values())).item()
            return result
        elif is_regression:
            return {"mse": ((preds - p.label_ids) ** 2).mean().item()}
        else:
            return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}

...

    # Initialize our Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset if training_args.do_train else None,
        eval_dataset=eval_dataset if training_args.do_eval else None,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        data_collator=data_collator,
    )

    # Training
    if training_args.do_train:
        checkpoint = None
        if training_args.resume_from_checkpoint is not None:
            checkpoint = training_args.resume_from_checkpoint
        elif last_checkpoint is not None:
            checkpoint = last_checkpoint
        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        metrics = train_result.metrics
        max_train_samples = (
            data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
        )
        metrics["train_samples"] = min(max_train_samples, len(train_dataset))

        trainer.save_model()  # Saves the tokenizer too for easy upload
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()

    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        # Loop to handle MNLI double evaluation (matched, mis-matched)
        tasks = [data_args.task_name]
        eval_datasets = [eval_dataset]
        if data_args.task_name == "mnli":
            tasks.append("mnli-mm")
            eval_datasets.append(raw_datasets["validation_mismatched"])

        for eval_dataset, task in zip(eval_datasets, tasks):
            metrics = trainer.evaluate(eval_dataset=eval_dataset)

            max_eval_samples = (
                data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
            )
            metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))

            trainer.log_metrics("eval", metrics)
            trainer.save_metrics("eval", metrics)
    "output_dir": "./output_dir",
    "do_train": true,
    "do_eval": true,
    "learning_rate": 1e-5,
    "per_device_train_batch_size": 32,
    "per_device_eval_batch_size": 32,
    "logging_strategy": "epoch",
    "save_strategy": "epoch",
    "evaluation_strategy": "epoch",
    "prediction_loss_only": false,

I have a question during training my own dataset, forked base code from run_glue.py. The arguments are my TrainingArguments.
During training / validation, it seems that compute_metrics never invoked while other things run correctly.

How can I fix this so I can get accuracy or other metrics?
Please let me know if you need more information or code :slight_smile:

Are you sure your datasets has proper labels? This may be the reason the compute metrics is skipped.

Hi, I investigated the code with debugger,

and I checked whether there is labels before I put my eval_dataset (in case of evaluation) to trainer.evaluate(). code example

I got batched eval_dataset with shape (batch_size, 6) which is consist of
['attention_mask', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'], and there were proper labels as you concerned.

Is there any ways to get access inside of the inner method evaluation_loop so I can check how it works?

You can see the batches that will be passed to your model for evaluation with:

for batch in trainer.get_eval_dataloader(eval_dataset):
    break

And see if it does contain the "labels" key.

2 Likes


As you can see the image above,
I can get 'labels' key in batch but still Trainer doesn’t return metrics.

I would just return to classic and compute metrics manually for now…

Thank you for your answer! :grinning_face_with_smiling_eyes:

Hi,
I have the same problem and it still does not work

  • I define my own compute_metrics() function
  • create the Trainer is written above
for batch in trainer.get_eval_dataloader(eval_dataset):
    print(batch)
    break

gives me “labels” but the compute_metrics function is never called. What else has to be configures ?
thanks !

1 Like

@jheinecke

Avoid modifying TrainingArguments keys manually, especially for the evaluation strategy, logging strategy or save strategy. Indeed the __post_init__ from TrainingArguments makes sure we use instances of IntervalStrategy and not simple strings, so if you override with e.g. training_args.evaluation_strategy = "steps" you will have troubles. If you really need to override, use training_args.evaluation_strategy = IntervalStrategy.STEPS

See transformers/trainer_callback.py at 8afaaa26f5754948f4ddf8f31d70d0293488a897 · huggingface/transformers · GitHub and transformers/training_args.py at 8afaaa26f5754948f4ddf8f31d70d0293488a897 · huggingface/transformers · GitHub