Add metrics to object detection example

Hello everybody,

im new with huggingface and wanted to try out the object detection. So i ran the transformers object detection example from the huggingface docs (this one here: Object detection) and wanted to add some metrics while training the model. Everytime i get the following error:

TypeError: Can’t pad the values of type <class ‘transformers.image_processing_utils.BatchFeature’>, only of nested list/tuple/dicts of tensors.

I create a eval_dataset like this:

cppe5["test"] = cppe5["test"].with_transform(transform_aug_ann)

and add it to the trainer:

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(

Maybe someone here can help me?

Thanks in advance. Greetings Christoph

cc @MariaK

Hi Christoph! Thanks for the question, and sorry about the delay. In the guide, we didn’t use the evaluation in the Trainer, because accuracy usually isn’t used for object detection. Here, we evaluated separately on standard COCO metrics. If you want to measure accuracy during training, you’ll need to check how the labels are batched in collate_fn(batch), (e.g. use batch[“labels”] = [dict(label) for label in labels] instead). You may also need to modify your compute_metrics function.
If this doesn’t work, @ybelkada may be able to help with the evaluation.

Hey! Sorry for reviving this, and I may be kicking a dead horse here.

Some context: super new to both hugging face and AI/ML. Played around with the TF Object Detection API about a year ago (which was a horrible experience in comparison), and trying to update my “framework” to use huggingface tools.

Personally, I like the idea of having some kind of eval during training without having to write files to disk. I saw that there’s a metric for mean IoU Mean IoU - a Hugging Face Space by evaluate-metric, which I think should be a good metric for this, but I can’t for the life of me get that to work. Do you have any tips assuming this is even a good thing to do?

Hi @macklenc were you able to find a solution? I’ve been getting into object detection as well and I recently tried TFOD. I was looking for a way to do continuous evaluation with the coco metrics and I haven’t found a way to be able to implement that.

I did get something working, and sure hope it’s right. I decided to use torchmetrics’ MAP implementation from torchmetrics.detection import MeanAveragePrecision. It turns out that in the evaluation step, the model returns an extra dictionary element called loss_dict that wasn’t being accounted for properly in the returned logits, so I was able to disable that when I created my model, e.g.:

model = AutoModelForObjectDetection.from_pretrained(

And I structured my compute_metrics as:

def compute_metrics(eval_pred: EvalPrediction, map: MeanAveragePrecision):
    (scores, pred_boxes, last_hidden_state, encoder_last_hidden_state), labels = eval_pred
    # scores shape: (batch_size, number of detected anchors, num_classes + 1) last class is the no-object class
    # pred_boxes shape: (batch_size, number of detected anchors, 4)
    predictions = []
    for score, box in zip(scores, pred_boxes):
        # Extract the bounding boxes, labels, and scores from the model's output
        pred_scores = torch.from_numpy(score[:, :-1])  # Exclude the no-object class
        pred_boxes = torch.from_numpy(box)
        pred_labels = torch.argmax(pred_scores, dim=-1)

        # Get the scores corresponding to the predicted labels
        pred_scores_for_labels = torch.gather(pred_scores, 1, pred_labels.unsqueeze(-1)).squeeze(-1)
                "boxes": pred_boxes,
                "scores": pred_scores_for_labels,
                "labels": pred_labels,
    target = [
            "boxes": torch.from_numpy(labels[i]["boxes"]),
            "labels": torch.from_numpy(labels[i]["class_labels"]),
        for i in range(len(labels))
    map.update(preds=predictions, target=target)
    results = map.compute()
    # Convert tensors to scalars/lists, MLFlow doesn't really like tensors
    results = {k: v.tolist() if isinstance(v, torch.Tensor) else v for k, v in results.items()}
    return results

When setting up the trainer I used the following:

mAP = MeanAveragePrecision(box_format="cxcywh", class_metrics=True)
metrics = functools.partial(compute_metrics, map=mAP, image_processor=image_processor)
trainer = Trainer(
    data_collator=lambda batch: collate_fn(batch, image_processor),
    # preprocess_logits_for_metrics=lambda logits, labels: logits[1:], # This is another way to remove the extra dict

It’s not the cleanest solution, but seems to be working for me. If you come up with a cleaner solution (e.g. using the image processor to decode the tensors since using a different backend like YOLOS breaks the compute_metrics function) I’d definitely appreciate it if you could share.

Hi @macklenc, could you please share your data loader/ anything done to the eval_dataset as well as your collate_fn() function? Im still running into a issue with my batched test set.

@harpergrieve Here’s my collate_fn(), if I recall I didn’t change any of the functionality there:

def collate_fn(batch, image_processor):
    pixel_values = [item["pixel_values"] for item in batch]
    encoding = image_processor.pad(pixel_values, return_tensors="pt")
    labels = [dict(item["labels"]) for item in batch]
    batch = {"pixel_values": encoding["pixel_values"], "pixel_mask": encoding["pixel_mask"], "labels": labels}
    # batch = {"pixel_values": encoding["pixel_values"], "labels": labels}  # For YOLOS backend
    return batch

My train/test datasets are created with:

    train_ds = dataset["train"].with_transform(
        lambda examples: transform_aug_ann(examples, image_processor, train_transform)
    test_ds = dataset["test"].with_transform(
        lambda examples: transform_aug_ann(examples, image_processor, test_transform)

where transform_aug_ann is

def transform_aug_ann(examples, image_processor, transform):
    image_ids = examples["image_id"]
    images, bboxes, area, categories = [], [], [], []
    for image, objects in zip(examples["image"], examples["objects"]):
        image = np.array(image.convert("RGB"))[:, :, ::-1]
        out = transform(image=image, bboxes=objects["bbox"], category=objects["category"])
    targets = [
        {"image_id": id_, "annotations": format_annotations(id_, cat_, ar_, box_)}
        for id_, cat_, ar_, box_ in zip(image_ids, categories, area, bboxes)
    return image_processor(images=images, annotations=targets, return_tensors="pt")


def create_transform(width, height):
    train_transform = albumentations.Compose(
            albumentations.SmallestMaxSize(max_size=1000, p=1),
            albumentations.RandomScale(p=0.5, scale_limit=(-0.4, 0.8)),
            format="coco", label_fields=["category"], min_area=400, min_visibility=0.7

    test_transform = albumentations.Compose(
            albumentations.Resize(width, height),
        bbox_params=albumentations.BboxParams(format="coco", label_fields=["category"]),
    return train_transform, test_transform

Unfortunately I haven’t run this against the CPPE5 (or equivalent) dataset, I ended up creating a custom one for a personal project, but it mimicked the structure of CPPE5 and my data loader is just load_dataset(str(module_directory / "my_dataset"), name=config). If you can share your project I can see if I can find the source of the problem.

Thanks! figured it out, now im getting a weird cuda out of mem error while validation is happening…