Owl-v2 bounding box misalignment problem

I’m running Owl-v2 over a dataset having 30k images and 75 classes.

The same dataset ran well using previous Owl-vit version. With the new version, I’m experiencing some disalignment on the bounding box detected. I attach an example. On the left, Owl-vit 1, On the right Owl-vit 2. Same problem over thousands of images. (See boat and cloud bb on the top right for example.)

This deeply affects the quality of results, obviously. Very strange. Someone detect same problem? Thanks.

Hi,

Make sure to visualize results on the padded image, rather than the original one. See processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble") may have some problem · Issue #27205 · huggingface/transformers · GitHub for details, or my demo notebook.

Hi nielsr,

I need to draw and export bounding box coordinates referred to the original image, not padded. Is that possible using OWlv2? With the previous version I did. Something has changed?

Not really. The Google authors also visualize the bounding boxes on the padded image in their original Colab notebook.

image

The reason this works for OWLv1 is because the image processor in the Transformers library was slightly different from the one in the Scenic (JAX-based) repository. OwlViTImageProcessor applies center cropping, which the original implementation does not. Hence this was corrected for OWLv2 (with OWLv2ImageProcessor), which applies identical image preprocessing settings as the original implementation, including padding the image.

Hi nielsr, Thanks.

So, now the question is: how can I draw bb on the original image (not squared)? I have to add padding to bb coordinates?

I need bounding boxes over original images (not padded and not squared).

Where I can retrieve the padding used?

Hi gfatigati,

You can do this easily by cropping the padded image with bounding boxes and resizing it to the original size. However, if the original image was larger/longer than 960px it’ll get upsampled after the resize, so you could decide to skip the resize and stick with the cropped image, with the original aspect ratio.

Follow the updated visualization instructions after nielsr’s PR:
https://huggingface.co/docs/transformers/model_doc/owlv2#transformers.Owlv2ForObjectDetection.forward.example but save the original image’s dimensions after opening it. Then use those values to crop and resize the visualization image.

# Save original dimensions here
w, h = image.size 
# Crop and resize
unnormalized_image = unnormalized_image.crop((0, 0, 960, h * 960/w))
unnormalized_image = unnormalized_image.resize((w,h))

Full example:

import requests
from PIL import Image, ImageDraw
import numpy as np
import torch
from transformers import AutoProcessor, Owlv2ForObjectDetection
from transformers.utils.constants import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD

processor = AutoProcessor.from_pretrained("google/owlv2-base-patch16-ensemble")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble")

image = Image.open(img_file)
w, h = image.size # Save dimensions here

inputs = processor(text=texts, images=image, return_tensors="pt")
# forward pass
with torch.no_grad():
    outputs = model(**inputs)

# Note: boxes need to be visualized on the padded, unnormalized image
# hence we'll set the target image sizes (height, width) based on that

def get_preprocessed_image(pixel_values):
    pixel_values = pixel_values.squeeze().numpy()
    unnormalized_image = (pixel_values * np.array(OPENAI_CLIP_STD)[:, None, None]) + np.array(OPENAI_CLIP_MEAN)[:, None, None]
    unnormalized_image = (unnormalized_image * 255).astype(np.uint8)
    unnormalized_image = np.moveaxis(unnormalized_image, 0, -1)
    unnormalized_image = Image.fromarray(unnormalized_image)
    return unnormalized_image

unnormalized_image = get_preprocessed_image(inputs.pixel_values)

target_sizes = torch.Tensor([unnormalized_image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to final bounding boxes and scores
results = processor.post_process_object_detection(
    outputs=outputs, threshold=0.2, target_sizes=target_sizes
)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

# Draw on the image
draw = ImageDraw.Draw(unnormalized_image)
for box, score, label in zip(boxes, scores, labels):     
      xmin, ymin, xmax, ymax = box
      draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
      draw.text((xmin, ymin), f"{texts[label]}: {round(score.item(),2)}", fill="white")

# Crop the padded image to its original aspect ratio
unnormalized_image = unnormalized_image.crop((0, 0, 960, h * 960/w))
# Resize it to the original size
unnormalized_image = unnormalized_image.resize((w,h))
unnormalized_image.show()

Hi lhovon,

from what I understand, you draw the bbox on resized and padded image, and after you restore the original size of image and so the bounding boxes. This is not really what I need, maybe I’m not explained well. I need to retrieve bounding box coordinates over original image in order to store into xml file, not just to draw.

But maybe I solved simply using:

target_size = torch.Tensor([[max(W,H), max(W,H)]]).to(device)
results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_size, threshold=score_threshold)

according to:

https://github.com/huggingface/transformers/issues/27205

The retrieved bounding box are drawn over the original images, with no need to resize or anything. I tried with thousands of images, and seems to works well.

FWIW - it seems to always be off by the ratio of width to height. The following code worked for me after running inference on the original image without adding any padding:

##############################################################
# OWLv2 image post processing seems to have a bug whereby, if
# the image is not square, the coordinates in the lesser
# dimension are off by a ratio of the lesser dimension to the
# greater dimension. This is a workaround to get the correct.
##############################################################
width_ratio = 1
height_ratio = 1
width = <original_width>
height = <original_height>
if width > height:
    height_ratio = height / width
elif height > width:
    width_ratio = width / height

x1 = detection.bbox.x1 / width_ratio
y1 = detection.bbox.y1 / height_ratio
x2 = detection.bbox.x2 / width_ratio
y2 = detection.bbox.y2 / height_ratio
1 Like