How to Log Accuracy with Metadata in a Sentence Regression Task?

I’m working on a sentence regression task, where each sample consists of a sentence paired with a numerical scalar . However, each sample also includes metadata (e.g., project name, task type). I want to compute and log the accuracy with respect to this metadata during evaluation.

What I’ve Tried

So far, I’ve successfully:

  1. Modified the dataset class to return a data_dict containing:
  • input_ids
  • labels (for loss calculation, just like in a language modeling task)
  • numerical_score (the scalar target for regression)
  • Metadata (a string field like Project or Task)
  1. Updated the collate function to pass along all these elements correctly to the model’s forward function.
  2. confirmed that this is passed along the model outputs as an additional metadata attribute.

Problem: Metadata Gets Lost before reachingcompute_metrics

The issue arises during the evaluation_loop. I need access to metadata inside the compute_metrics function, but the Trainer class doesn’t seem to provide a clean way to pass it along.

for example

if args.include_inputs_for_metrics:
    metrics = self.compute_metrics(
        EvalPrediction(predictions=all_preds, label_ids=all_labels, inputs=all_inputs)
    )
else:
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))

I enabled include_inputs_for_metrics=True hoping the metadata would be passed as part of all_inputs, but it gets stripped by this line in the evaluation loop:

inputs_decode = self._prepare_input(inputs[main_input_name]) if args.include_inputs_for_metrics else None

What I Want to Achieve

I need a way to log the accuracy or any custom metric with respect to the metadata. Ideally, I don’t want to override the entire evaluation loop or write my own Trainer class, as that feels cumbersome and difficult to maintain, especially with parallelism.

What I’m Looking For

Is there a clean way to pass metadata through the evaluation loop, without hacking the Trainer class or completely rewriting the evaluation logic? I’m likely not the first person facing this issue, and I suspect there’s an elegant solution I might be missing.

Thanks in advance for your help!

1 Like