Add metrics to object detection example

Hello everybody,

im new with huggingface and wanted to try out the object detection. So i ran the transformers object detection example from the huggingface docs (this one here: Object detection) and wanted to add some metrics while training the model. Everytime i get the following error:

TypeError: Can’t pad the values of type <class ‘transformers.image_processing_utils.BatchFeature’>, only of nested list/tuple/dicts of tensors.

I create a eval_dataset like this:

cppe5["test"] = cppe5["test"].with_transform(transform_aug_ann)

and add it to the trainer:

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    print("here")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    train_dataset=splitted_ds_encoded["train"],
    eval_dataset=splitted_ds_encoded['test'],
    tokenizer=image_processor,
    compute_metrics=compute_metrics,
)

Maybe someone here can help me?

Thanks in advance. Greetings Christoph

cc @MariaK

Hi Christoph! Thanks for the question, and sorry about the delay. In the guide, we didn’t use the evaluation in the Trainer, because accuracy usually isn’t used for object detection. Here, we evaluated separately on standard COCO metrics. If you want to measure accuracy during training, you’ll need to check how the labels are batched in collate_fn(batch), (e.g. use batch[“labels”] = [dict(label) for label in labels] instead). You may also need to modify your compute_metrics function.
If this doesn’t work, @ybelkada may be able to help with the evaluation.

Hey! Sorry for reviving this, and I may be kicking a dead horse here.

Some context: super new to both hugging face and AI/ML. Played around with the TF Object Detection API about a year ago (which was a horrible experience in comparison), and trying to update my “framework” to use huggingface tools.

Personally, I like the idea of having some kind of eval during training without having to write files to disk. I saw that there’s a metric for mean IoU Mean IoU - a Hugging Face Space by evaluate-metric, which I think should be a good metric for this, but I can’t for the life of me get that to work. Do you have any tips assuming this is even a good thing to do?

Hi @macklenc were you able to find a solution? I’ve been getting into object detection as well and I recently tried TFOD. I was looking for a way to do continuous evaluation with the coco metrics and I haven’t found a way to be able to implement that.

I did get something working, and sure hope it’s right. I decided to use torchmetrics’ MAP implementation from torchmetrics.detection import MeanAveragePrecision. It turns out that in the evaluation step, the model returns an extra dictionary element called loss_dict that wasn’t being accounted for properly in the returned logits, so I was able to disable that when I created my model, e.g.:

model = AutoModelForObjectDetection.from_pretrained(
    model_dest,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True,
    num_queries=3,
    keys_to_ignore_at_inference=["loss_dict"],
    )

And I structured my compute_metrics as:

def compute_metrics(eval_pred: EvalPrediction, map: MeanAveragePrecision):
    (scores, pred_boxes, last_hidden_state, encoder_last_hidden_state), labels = eval_pred
    # scores shape: (batch_size, number of detected anchors, num_classes + 1) last class is the no-object class
    # pred_boxes shape: (batch_size, number of detected anchors, 4)
    # https://github.com/openvinotoolkit/open_model_zoo/blob/master/models/public/detr-resnet50/README.md
    predictions = []
    for score, box in zip(scores, pred_boxes):
        # Extract the bounding boxes, labels, and scores from the model's output
        pred_scores = torch.from_numpy(score[:, :-1])  # Exclude the no-object class
        pred_boxes = torch.from_numpy(box)
        pred_labels = torch.argmax(pred_scores, dim=-1)

        # Get the scores corresponding to the predicted labels
        pred_scores_for_labels = torch.gather(pred_scores, 1, pred_labels.unsqueeze(-1)).squeeze(-1)
        predictions.append(
            {
                "boxes": pred_boxes,
                "scores": pred_scores_for_labels,
                "labels": pred_labels,
            }
        )
    target = [
        {
            "boxes": torch.from_numpy(labels[i]["boxes"]),
            "labels": torch.from_numpy(labels[i]["class_labels"]),
        }
        for i in range(len(labels))
    ]
    map.update(preds=predictions, target=target)
    results = map.compute()
    # Convert tensors to scalars/lists, MLFlow doesn't really like tensors
    results = {k: v.tolist() if isinstance(v, torch.Tensor) else v for k, v in results.items()}
    return results

When setting up the trainer I used the following:

mAP = MeanAveragePrecision(box_format="cxcywh", class_metrics=True)
metrics = functools.partial(compute_metrics, map=mAP, image_processor=image_processor)
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=lambda batch: collate_fn(batch, image_processor),
    train_dataset=train_ds,
    tokenizer=image_processor,
    eval_dataset=test_ds,
    compute_metrics=metrics,
    # preprocess_logits_for_metrics=lambda logits, labels: logits[1:], # This is another way to remove the extra dict
)

It’s not the cleanest solution, but seems to be working for me. If you come up with a cleaner solution (e.g. using the image processor to decode the tensors since using a different backend like YOLOS breaks the compute_metrics function) I’d definitely appreciate it if you could share.

2 Likes

Hi @macklenc, could you please share your data loader/ anything done to the eval_dataset as well as your collate_fn() function? Im still running into a issue with my batched test set.
Thanks

@harpergrieve Here’s my collate_fn(), if I recall I didn’t change any of the functionality there:

def collate_fn(batch, image_processor):
    pixel_values = [item["pixel_values"] for item in batch]
    encoding = image_processor.pad(pixel_values, return_tensors="pt")
    labels = [dict(item["labels"]) for item in batch]
    batch = {"pixel_values": encoding["pixel_values"], "pixel_mask": encoding["pixel_mask"], "labels": labels}
    # batch = {"pixel_values": encoding["pixel_values"], "labels": labels}  # For YOLOS backend
    return batch

My train/test datasets are created with:

    train_ds = dataset["train"].with_transform(
        lambda examples: transform_aug_ann(examples, image_processor, train_transform)
    )
    test_ds = dataset["test"].with_transform(
        lambda examples: transform_aug_ann(examples, image_processor, test_transform)
    )

where transform_aug_ann is

def transform_aug_ann(examples, image_processor, transform):
    image_ids = examples["image_id"]
    images, bboxes, area, categories = [], [], [], []
    for image, objects in zip(examples["image"], examples["objects"]):
        image = np.array(image.convert("RGB"))[:, :, ::-1]
        out = transform(image=image, bboxes=objects["bbox"], category=objects["category"])
        area.append(objects["area"])
        images.append(out["image"])
        bboxes.append(out["bboxes"])
        categories.append(out["category"])
    targets = [
        {"image_id": id_, "annotations": format_annotations(id_, cat_, ar_, box_)}
        for id_, cat_, ar_, box_ in zip(image_ids, categories, area, bboxes)
    ]
    return image_processor(images=images, annotations=targets, return_tensors="pt")

and

def create_transform(width, height):
    train_transform = albumentations.Compose(
        [
            albumentations.SmallestMaxSize(max_size=1000, p=1),
            albumentations.RandomScale(p=0.5, scale_limit=(-0.4, 0.8)),
        ],
        bbox_params=albumentations.BboxParams(
            format="coco", label_fields=["category"], min_area=400, min_visibility=0.7
        ),
    )

    test_transform = albumentations.Compose(
        [
            albumentations.Resize(width, height),
        ],
        bbox_params=albumentations.BboxParams(format="coco", label_fields=["category"]),
    )
    return train_transform, test_transform

Unfortunately I haven’t run this against the CPPE5 (or equivalent) dataset, I ended up creating a custom one for a personal project, but it mimicked the structure of CPPE5 and my data loader is just load_dataset(str(module_directory / "my_dataset"), name=config). If you can share your project I can see if I can find the source of the problem.

1 Like

Thanks! figured it out, now im getting a weird cuda out of mem error while validation is happening…

Is there a mAP COCO eval metric for instance segmentation?

Hello @macklenc ,

I am trying to implement your solution. There is some bug with the return of EvalPredictions.
The input to compute_metrics (EvalPredictions) have some inconsistent lengths of labels and scores. The scores length is equal to the length of evaluation dataset, while the labels length is equal to the eval_batch size, I searched a bit and debugged the Trainer class of huggingface, and I noticed that labels are concatenated together using the nested_detach function. This will concatenate the two lists differently thant list1+list2, it will concatenate item1 of the list1 with item2 with list2 and so on. this will aggregate multiple images into the same item.

Do you have any idea how to resolve this? Someone talked about it in this issue (here) on huggingface but never mentioned a solution/

The reason why (I also had this bug) is because trainer processes the label_ids attribute and thus make it unintelligible. If you look into it, you’ll see that you’ve got a batch of image ids together and boxes also together. Anyways, Huggingface seems to care a bit less about CV :wink:

Here’s the ugly solution which works (I did it).

  1. Update transformers to the newest version
  2. Opent the trainer.py file in the transformers library “wherever_you_have_your_python/lib/python3.11/site-packages/transformers/trainer.py”
  3. go all the way to the line 3253 and in the next line add all_eval_labels = []
  4. Then go to the for loop 11 lines down and WITHIN THAT LOOP add all_eval_labels.extend(inputs["labels"])
  5. Go further to the line 3371, and just before if self.compute_metrics is not None and all_preds is not None and all_labels is not None: add all_labels = all_eval_labels
    It should look line this:
...
all_labels = all_eval_labels
# Metrics!
if self.compute_metrics is not None and all_preds is not None and all_labels is not None:
...

VoilĂ ! The first part is done. Now the compute metrics method.

def compute_metrics(eval_pred: EvalPrediction):
    """Compute detection metrics"""

    _, scores, pred_boxes, last_hidden_state, encoder_last_hidden_state = eval_pred.predictions

    # scores shape: (number of samples, number of detected anchors, num_classes + 1) last class is the no-object class
    # pred_boxes shape: (number of samples, number of detected anchors, 4)
    # https://github.com/openvinotoolkit/open_model_zoo/blob/master/models/public/detr-resnet50/README.md
    predictions = []
    for score, box in zip(scores, pred_boxes):
        # Extract the bounding boxes, labels, and scores from the model's output
        pred_scores = torch.from_numpy(score[:, :-1])  # Exclude the no-object class
        pred_boxes = torch.from_numpy(box)
        pred_labels = torch.argmax(pred_scores, dim=-1)

        # Get the scores corresponding to the predicted labels
        pred_scores_for_labels = torch.gather(pred_scores, 1, pred_labels.unsqueeze(-1)).squeeze(-1)
        predictions.append(
            {
                "boxes": pred_boxes,
                "scores": pred_scores_for_labels,
                "labels": pred_labels,
            }
        )

    target = [
        {
            "boxes": eval_pred.label_ids[i]["boxes"].detach().cpu(),
            "labels": eval_pred.label_ids[i]["class_labels"].detach().cpu(),
        }
        for i in range(len(eval_pred.label_ids))
    ]
    map = MeanAveragePrecision(box_format="xywh")
    map.update(preds=predictions, target=target)
    results = map.compute()
    results = {k: v.tolist() if isinstance(v, torch.Tensor) else v for k, v in results.items()}
    return results

Then in my trainer I just add this method as compute_metrics parameter

trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=lambda batch: collate_fn(batch, is_yolo),
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics,
        tokenizer=image_processor
    )

It will work but it’s not a perfect solution and you should remember to remove it if you decide to tackle other tasks like NLP.
Cheers!