Inaccurate bboxes after finetuning DETR

I followed the Object Detection guide to fine-tune a DETR model. However, the predicted bboxes for objects in the upper left corner in an image tend to be more accurate than the bottom right corner (the further away from 0,0, the more inaccurate it gets). I’m not sure what could be wrong – whether it’s the transformations, the visualization code, or something else. The original bounding boxes for the dataset seem correct. I’d appreciate any pointers, thanks!

Notebook link: Google Colab

Example training image:

Image that demonstrates the bbox issue:

1 Like

It doesn’t look like a bug in the generative AI, or a normal bug…
It looks like the kind of bug that happens when the calculation is simply not precise enough, but in this case, isn’t round() suspicious?

  for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
      box = [round(i, 2) for i in box.tolist()]
      x, y, x2, y2 = tuple(box)
      draw.rectangle((x, y, x2, y2), outline="red", width=1)
      draw.text((x, y), model.config.id2label[label.item()], fill="white")

Thanks for the suggestion. I also thought it looked like a rounding error, but visually, it still looks the same after increasing the precision. If the rounding was done to normalized coordinates, then it would be more likely the cause, but I can’t see where that could be happening.

Detected chicken with confidence 0.999 at location [906.24420166, 463.305633545, 952.193481445, 512.982299805]

Hmmm, for example, if these numbers are wrong, then you must have missed something in your training.
They’re resized, and there’s a good chance that information about the coordinates is lost.
If this number is right, then the visualization program is buggy.

In any case, it is not such a serious error, and it is probably just a multiplication or rounding error at some point, whether in training or visualization. I guess your program is not doing everything wrong, as it is not backwards or disjointed and it is only a trend.

Or, have your AI become an AI that recognizes chicken butts because only chicken butts are educated during training?

Ok, switching from Albumentations to DetrImageProcessor to resize the images fixes it. I am still confused by the following:

  1. If I follow the Object Detection guide and set max_height and max_width to MAX_SIZE, I still get inaccurate coordinates. It only works when I set height and width to MAX_SIZE, disregarding the aspect ratio. Does this suggest that there’s a bug in the padding or collate function?
  2. Since I’m setting height and width, it seems identical to what I was doing with Albumentations. Why do they yield different results?

cc @nielsr @qubvel-hf do you have any insights?

1 Like

Hi, it’s indeed looks like some bug and maybe related to wrong box resize to the image coordinates. You can take inputs.pixel_values and it’s shape as a target size and try to visualize boxes on it

1 Like

Not sure if related - but in the guide’s augment_and_transform_batch function, there’s this snippet – why do we want to remove the mask? I tried removing this code, thinking that it could have caused the model to think the chickens are only in the top left section of the image (since the bottom / right sections are always padded), but it actually made things worse, now the coordinates are even more squished.

    if not return_pixel_mask:
        result.pop("pixel_mask", None)

Squished boxes:

I updated the notebook with more visualization, visualizing what a collated batch with a mask looks like. It looks correct, and I can’t find an issue with the inference visualization code either. Any pointers?

Those two lines are the body of the mask disable option, so don’t delete them.
I think the resizing is when the chicken is passed to the preprocessor.
Here’s an example

There is a way to do it manually with torch, but if you can do it with transformers, I recommend letting the transformers do it because it is easier, especially during batch processing.

Why might we want to disable the mask? Detr’s documentation:

Due to this resizing, images in a batch can have different sizes. DETR solves this by padding images up to the largest size in a batch, and by creating a pixel mask that indicates which pixels are real/which are padding.

Anyhow, the preprocessed images look correct, at least the
sizes and the bounding boxes do.

Preprocessed batch without mask:
preprocess_wo_mask

Preprocessed batch with mask:
preprocess-w-mask

Image processor:

from transformers import AutoImageProcessor
import albumentations
import numpy as np
import torch

IMAGE_SIZE = 480
MAX_SIZE = IMAGE_SIZE
checkpoint = "facebook/detr-resnet-50"
image_processor = AutoImageProcessor.from_pretrained(
    checkpoint,
    do_resize=True,
    size={"max_height": MAX_SIZE, "max_width": MAX_SIZE},
    do_pad=True,
    pad_size={"height": MAX_SIZE, "width": MAX_SIZE},
)

train_transform = albumentations.Compose(
    [
        albumentations.NoOp(),
    ],
    bbox_params=albumentations.BboxParams(format="coco", label_fields=["category"]),
)

Visualization of a preprocessed batch:

import numpy as np
import cv2

def pixel_values_to_img(pixel_values):
  npimg = pixel_values.numpy()
  npimg = np.transpose(npimg, (1,2,0))
  npimg = (npimg * image_processor.image_std + image_processor.image_mean) * 255
  npimg = (npimg).astype(np.uint8)
  return npimg

def pixel_mask_to_img(pixel_mask):
  npimg = pixel_mask.numpy()
  npimg = npimg[0]
  rgb = np.zeros((npimg.shape[0], npimg.shape[1], 3), dtype=np.uint8)
  rgb[:,:,0] = npimg * 255
  rgb[:,:,1] = npimg * 255
  rgb[:,:,2] = npimg * 255
  return rgb

# translate the preprocessed image batch into a PIL image
transformed_img = collate_fn([train_dataset_transformed[0]])
pixel_values = transformed_img['pixel_values']
if len(pixel_values.shape) > 3:
  pixel_values = pixel_values[0]
img=pixel_values_to_img(pixel_values)
img_w, img_h = img.shape[0], img.shape[1]

# if there's a mask, add a magenta tint to the valid, non-padded pixels
if 'pixel_mask' in transformed_img:
  mask=pixel_mask_to_img(transformed_img['pixel_mask'])
  masked_image = img.copy()
  masked_image = np.where(mask.astype(int),
                          np.array([255,0,255], dtype='uint8'),
                          masked_image)

  masked_image = masked_image.astype(np.uint8)
  masked_image = cv2.addWeighted(img, 0.6, masked_image, 0.4, 0)
  img = masked_image

img = Image.fromarray(img)

# draw bboxes
draw = ImageDraw.Draw(img)
for box in transformed_img['labels'][0]['boxes']:
  cx,cy,w,h=img_w*box[0],img_h*box[1],img_w*box[2],img_h*box[3]
  x,y=cx-w/2,cy-h/2
  x2,y2=x+w,y+h
  draw.rectangle((x,y,x2,y2), outline="white", width=1)

img

If you’re stewing over a bug that can’t be fixed, you can try it with different data that seems to work. A square one would be even better.
Once we know there is a correct answer, we can then track down the bug by process of elimination. If there is no correct answer, it is a bug in the library, but that is not likely to happen this time.

Anyway, if you want to disable the mask, you should change it here, not in the function. Whether you do or not is beside the point.

transform_batch = partial(
    #augment_and_transform_batch, transform=train_transform, image_processor=image_processor, return_pixel_mask=True
    augment_and_transform_batch, transform=train_transform, image_processor=image_processor, return_pixel_mask=False
)

Does this suggest that there’s a bug in the padding or collate function?

I think the behavior of the resizing process is stubborn.

There are various functions for various models in transformers, but most of them are inherited in the class inheritance, so in essence, they are reusable.
If a problem occurs in one model, it can occur in any model. If it is a bug, fixing it will fix everything.