[Owlv2 - image_guided_detection - embed_image_query] Why choosing the least similar box from selected ones?

dhoa · November 24, 2023, 9:13am

I’m trying to understand the owlv2 image_guided_detection and have a question.

From this tutorial about OWLv2 zero_oneshot_owlv2_ObjectionDetection, the author said that the image_guided_detection part uses a heuristic way to get the patch in the source image which most likely contains an object

Looking at the source code at https://github.com/huggingface/transformers/blob/main/src/transformers/models/owlv2/modeling_owlv2.py

The heuristic he mentioned I believe is here:

            iou_threshold = torch.max(ious) * 0.8

            selected_inds = (ious[0] >= iou_threshold).nonzero()
            if selected_inds.numel():
                selected_embeddings = class_embeds[i][selected_inds.squeeze(1)]
                mean_embeds = torch.mean(class_embeds[i], axis=0)
                mean_sim = torch.einsum("d,id->i", mean_embeds, selected_embeddings)
                best_box_ind = selected_inds[torch.argmin(mean_sim)]
                best_class_embeds.append(class_embeds[i][best_box_ind])
                best_box_indices.append(best_box_ind)

So what I understand from this code:

Select a list of bbox
Calculate the mean of embedding of these bbox
Calculate the similarity of the mean_embedding and all bbox_embeddings
Select the bbox which is the least similar to the mean via best_box_ind = selected_inds[torch.argmin(mean_sim)]

So, why choose the least similar here instead of the most similar one with argmax? We want to choose a box closest to the mean, right?

Thanks

dhoa · November 24, 2023, 10:20am

[Update]

Maybe the reason for choosing the least similar is to remove noise because when I change from argmin to argmax. I have a lot of False Positives ( even when the chosen bounding box is not different too much for both cases, very weird )

Still not sure what is the best way to work with OwlV2 for image-guided detection, anyone know the best practices?

Thanks

dhoa · November 24, 2023, 1:43pm

The reason can be found in the original implementation of OWLv2 from scenic:

github.com

google-research/scenic/blob/main/scenic/projects/owl_vit/notebooks/inference.py

"""Code for running (interactive) inference with OWL-ViT models."""

import dataclasses
import functools
from typing import Any, Dict, Tuple

from flax import linen as nn
import jax
import jax.numpy as jnp
import ml_collections
import numpy as np
from scenic.model_lib.base_models import box_utils
from scenic.projects.owl_vit.notebooks import numpy_cache
from scipy import special as sp_special
from skimage import transform as skimage_transform
import tensorflow as tf

sigmoid = sp_special.expit  # Sigmoid is a more familiar name.
QUERY_PAD_BIN_SIZE = 50

This file has been truncated. show original

# Due to the DETR style bipartite matching loss, only one embedding
# feature for each object is "good" and the rest are "background." To find
# the one "good" feature we use the heuristic that it should be dissimilar
# to the mean embedding.

Does it also mean that OWLv2 image-guided-detection is very sensible to noise? just a very small difference in the query bounding box and the result is completely wrong

taher30 · April 11, 2025, 7:55pm

This seem to be the case here.
I have been trying to make this work for my project and it performs worse using the image_guided_detection method of the og class.
Did you happen to find the solution to make this work?

dhoa · April 11, 2025, 8:31pm

It’s been a while since I worked with Owlv2, so I don’t remember everything in detail. But in the end, I made it work, but please double-check my comment here

HF Owl code runs a heuristic to find the good feature that represents the object. Due to DETR bipartite matching loss, even 2 bounding boxes that have high IoU, one can represent the background and the other represents the object. If we choose an incorrect feature, we might end up detecting the background ( The image in my old comment above )

But this is for Owl-v1, not v2, HF repo uses the same logic of v1 but it’s not optimal for Owl-v2. Owl-v2 has an objectness score and we could use it directly to get the best feature instead of relying on the heuristic of v1. It’s confirmed by Google in an issue I asked before: https://github.com/google-research/scenic/issues/989

So, what I remember is that you run Owl-v2 on the reference image, extract the feature with the highest objectness score, and then use this feature for your image-guided detection. Also, be careful to double check the bounding box of the reference object, you can have a case your reference image has many possible objects.

Hope it helps

taher30 · April 13, 2025, 9:42am

I will give it a try, and try to modify the class for my workflow. I know I am gonna run into issues, but I’ll give t a try.
This clears lots of things, and it seems like I won’t have to choose the query embedding each time for it and just use argmax to choose the one with highest score.
Only if there was a way to annotate the target image myself, and use the annotated part as a query to make the detections.
However, the given method works also.
Thanks for taking out your time and reply

Topic		Replies	Views
Caching image prototype embeddings for image-guided object detection using OWL-ViT 🤗Transformers	1	449	April 11, 2025
Owl-v2 bounding box misalignment problem Beginners	7	1276	February 5, 2024
Non Maximum Merging for Oriented BBox Intermediate	1	105	January 8, 2025
Failling fine-tuning OWL-ViT Beginners	4	2697	April 11, 2023
Owl-Vit postprocess API bbox conversion Beginners	5	344	February 9, 2024

[Owlv2 - image_guided_detection - embed_image_query] Why choosing the least similar box from selected ones?

Related topics