Inaccurate bboxes after finetuning DETR

joe611 · October 2, 2024, 12:58am

I followed the Object Detection guide to fine-tune a DETR model. However, the predicted bboxes for objects in the upper left corner in an image tend to be more accurate than the bottom right corner (the further away from 0,0, the more inaccurate it gets). I’m not sure what could be wrong – whether it’s the transformations, the visualization code, or something else. The original bounding boxes for the dataset seem correct. I’d appreciate any pointers, thanks!

Notebook link: Google Colab

Example training image:

Image that demonstrates the bbox issue:

John6666 · October 2, 2024, 2:16am

It doesn’t look like a bug in the generative AI, or a normal bug…
It looks like the kind of bug that happens when the calculation is simply not precise enough, but in this case, isn’t round() suspicious?

  for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
      box = [round(i, 2) for i in box.tolist()]
      x, y, x2, y2 = tuple(box)
      draw.rectangle((x, y, x2, y2), outline="red", width=1)
      draw.text((x, y), model.config.id2label[label.item()], fill="white")

joe611 · October 2, 2024, 3:33am

Thanks for the suggestion. I also thought it looked like a rounding error, but visually, it still looks the same after increasing the precision. If the rounding was done to normalized coordinates, then it would be more likely the cause, but I can’t see where that could be happening.

John6666 · October 2, 2024, 5:47am

Detected chicken with confidence 0.999 at location [906.24420166, 463.305633545, 952.193481445, 512.982299805]

Hmmm, for example, if these numbers are wrong, then you must have missed something in your training.
They’re resized, and there’s a good chance that information about the coordinates is lost.
If this number is right, then the visualization program is buggy.

In any case, it is not such a serious error, and it is probably just a multiplication or rounding error at some point, whether in training or visualization. I guess your program is not doing everything wrong, as it is not backwards or disjointed and it is only a trend.

Or, have your AI become an AI that recognizes chicken butts because only chicken butts are educated during training?

joe611 · October 3, 2024, 7:13pm

Ok, switching from Albumentations to DetrImageProcessor to resize the images fixes it. I am still confused by the following:

If I follow the Object Detection guide and set max_height and max_width to MAX_SIZE, I still get inaccurate coordinates. It only works when I set height and width to MAX_SIZE, disregarding the aspect ratio. Does this suggest that there’s a bug in the padding or collate function?
Since I’m setting height and width, it seems identical to what I was doing with Albumentations. Why do they yield different results?

cc @nielsr @qubvel-hf do you have any insights?

qubvel-hf · October 3, 2024, 7:33pm

Hi, it’s indeed looks like some bug and maybe related to wrong box resize to the image coordinates. You can take inputs.pixel_values and it’s shape as a target size and try to visualize boxes on it

joe611 · October 5, 2024, 7:09am

Not sure if related - but in the guide’s augment_and_transform_batch function, there’s this snippet – why do we want to remove the mask? I tried removing this code, thinking that it could have caused the model to think the chickens are only in the top left section of the image (since the bottom / right sections are always padded), but it actually made things worse, now the coordinates are even more squished.

    if not return_pixel_mask:
        result.pop("pixel_mask", None)

Squished boxes:

I updated the notebook with more visualization, visualizing what a collated batch with a mask looks like. It looks correct, and I can’t find an issue with the inference visualization code either. Any pointers?

John6666 · October 5, 2024, 8:08am

Those two lines are the body of the mask disable option, so don’t delete them.
I think the resizing is when the chicken is passed to the preprocessor.
Here’s an example

There is a way to do it manually with torch, but if you can do it with transformers, I recommend letting the transformers do it because it is easier, especially during batch processing.

joe611 · October 5, 2024, 9:34pm

Why might we want to disable the mask? Detr’s documentation:

Due to this resizing, images in a batch can have different sizes. DETR solves this by padding images up to the largest size in a batch, and by creating a pixel mask that indicates which pixels are real/which are padding.

Anyhow, the preprocessed images look correct, at least the
sizes and the bounding boxes do.

Preprocessed batch without mask:
preprocess_wo_mask

Preprocessed batch with mask:
preprocess-w-mask

Image processor:

from transformers import AutoImageProcessor
import albumentations
import numpy as np
import torch

IMAGE_SIZE = 480
MAX_SIZE = IMAGE_SIZE
checkpoint = "facebook/detr-resnet-50"
image_processor = AutoImageProcessor.from_pretrained(
    checkpoint,
    do_resize=True,
    size={"max_height": MAX_SIZE, "max_width": MAX_SIZE},
    do_pad=True,
    pad_size={"height": MAX_SIZE, "width": MAX_SIZE},
)

train_transform = albumentations.Compose(
    [
        albumentations.NoOp(),
    ],
    bbox_params=albumentations.BboxParams(format="coco", label_fields=["category"]),
)

Visualization of a preprocessed batch:

import numpy as np
import cv2

def pixel_values_to_img(pixel_values):
  npimg = pixel_values.numpy()
  npimg = np.transpose(npimg, (1,2,0))
  npimg = (npimg * image_processor.image_std + image_processor.image_mean) * 255
  npimg = (npimg).astype(np.uint8)
  return npimg

def pixel_mask_to_img(pixel_mask):
  npimg = pixel_mask.numpy()
  npimg = npimg[0]
  rgb = np.zeros((npimg.shape[0], npimg.shape[1], 3), dtype=np.uint8)
  rgb[:,:,0] = npimg * 255
  rgb[:,:,1] = npimg * 255
  rgb[:,:,2] = npimg * 255
  return rgb

# translate the preprocessed image batch into a PIL image
transformed_img = collate_fn([train_dataset_transformed[0]])
pixel_values = transformed_img['pixel_values']
if len(pixel_values.shape) > 3:
  pixel_values = pixel_values[0]
img=pixel_values_to_img(pixel_values)
img_w, img_h = img.shape[0], img.shape[1]

# if there's a mask, add a magenta tint to the valid, non-padded pixels
if 'pixel_mask' in transformed_img:
  mask=pixel_mask_to_img(transformed_img['pixel_mask'])
  masked_image = img.copy()
  masked_image = np.where(mask.astype(int),
                          np.array([255,0,255], dtype='uint8'),
                          masked_image)

  masked_image = masked_image.astype(np.uint8)
  masked_image = cv2.addWeighted(img, 0.6, masked_image, 0.4, 0)
  img = masked_image

img = Image.fromarray(img)

# draw bboxes
draw = ImageDraw.Draw(img)
for box in transformed_img['labels'][0]['boxes']:
  cx,cy,w,h=img_w*box[0],img_h*box[1],img_w*box[2],img_h*box[3]
  x,y=cx-w/2,cy-h/2
  x2,y2=x+w,y+h
  draw.rectangle((x,y,x2,y2), outline="white", width=1)

img

John6666 · October 6, 2024, 5:29am

If you’re stewing over a bug that can’t be fixed, you can try it with different data that seems to work. A square one would be even better.
Once we know there is a correct answer, we can then track down the bug by process of elimination. If there is no correct answer, it is a bug in the library, but that is not likely to happen this time.

Anyway, if you want to disable the mask, you should change it here, not in the function. Whether you do or not is beside the point.

transform_batch = partial(
    #augment_and_transform_batch, transform=train_transform, image_processor=image_processor, return_pixel_mask=True
    augment_and_transform_batch, transform=train_transform, image_processor=image_processor, return_pixel_mask=False
)

Does this suggest that there’s a bug in the padding or collate function?

I think the behavior of the resizing process is stubborn.

github.com/huggingface/transformers

Unable to disable the `do_resize` option in the CLIPImageProcessor

opened 08:11AM - 14 Feb 23 UTC

closed 12:33PM - 15 Feb 23 UTC

adhakal224

### System Info ``` - `transformers` version: 4.26.1 - Platform: Linux-3.10….0-1160.53.1.el7.x86_64-x86_64-with-glibc2.31 - Python version: 3.10.8 - Huggingface_hub version: 0.11.1 - PyTorch version (GPU?): 1.13.1+cu117 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: no - Using distributed or parallel set-up in script?: no ``` ### Who can help? @amyeroberts @NielsRogge @arthur ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below) ### Reproduction I have a situation where I convert my inputs to tensor and resize them before passing them to CLIPImageProcessor. Hence, I want to disable this resize operation inside the CLIPImageProcessor. However, when i pass `False` to the `do_resize` flag, the tensors that it returns are still resized to the default 224x224 size. Here is a reproducible code: ``` from transformers import CLIPImageProcessor batched_tensor = tensor.rand((2,3,512,512)) processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-base-patch32") processed_image = processor(list(batched_tensor), return_tensors='pt', padding=True, do_resize=False) print(processed_image['pixel_values'].shape) Output: torch.Size([2, 3, 224, 224]) ``` I need to use a `DataLoader` which does not accept variable sizes input (which is the case with my data) and hence I need to resize them before sending it to the CLIPImageProcessor. ### Expected behavior I would expect the output of the last line to be `torch.Size([2,3,512,512])`

There are various functions for various models in transformers, but most of them are inherited in the class inheritance, so in essence, they are reusable.
If a problem occurs in one model, it can occur in any model. If it is a bug, fixing it will fix everything.

Topic		Replies	Views
Multiple and Inaccurate bboxes after finetuning DETR Beginners	1	35	March 14, 2025
LayoutLMv3 inference - bboxes are incorrect 🤗Transformers	0	112	May 10, 2024
Fine tuning DETR errors during training Beginners	2	88	July 2, 2025
Example DeTr Object Detectors not predicting after fine tuning Beginners	6	1383	May 9, 2024
ValueError: boxes1 must be in [x0, y0, x1, y1] (corner) format 🤗Transformers	2	101	January 28, 2025

Inaccurate bboxes after finetuning DETR

Related topics