[Owlv2 - image_guided_detection - embed_image_query] Why choosing the least similar box from selected ones?

I’m trying to understand the owlv2 image_guided_detection and have a question.

From this tutorial about OWLv2 zero_oneshot_owlv2_ObjectionDetection, the author said that the image_guided_detection part uses a heuristic way to get the patch in the source image which most likely contains an object

Looking at the source code at https://github.com/huggingface/transformers/blob/main/src/transformers/models/owlv2/modeling_owlv2.py

The heuristic he mentioned I believe is here:

            iou_threshold = torch.max(ious) * 0.8

            selected_inds = (ious[0] >= iou_threshold).nonzero()
            if selected_inds.numel():
                selected_embeddings = class_embeds[i][selected_inds.squeeze(1)]
                mean_embeds = torch.mean(class_embeds[i], axis=0)
                mean_sim = torch.einsum("d,id->i", mean_embeds, selected_embeddings)
                best_box_ind = selected_inds[torch.argmin(mean_sim)]
                best_class_embeds.append(class_embeds[i][best_box_ind])
                best_box_indices.append(best_box_ind)

So what I understand from this code:

  1. Select a list of bbox
  2. Calculate the mean of embedding of these bbox
  3. Calculate the similarity of the mean_embedding and all bbox_embeddings
  4. Select the bbox which is the least similar to the mean via best_box_ind = selected_inds[torch.argmin(mean_sim)]

So, why choose the least similar here instead of the most similar one with argmax? We want to choose a box closest to the mean, right?

Thanks

1 Like

[Update]

Maybe the reason for choosing the least similar is to remove noise because when I change from argmin to argmax. I have a lot of False Positives ( even when the chosen bounding box is not different too much for both cases, very weird :thinking:)

Still not sure what is the best way to work with OwlV2 for image-guided detection, anyone know the best practices?

Thanks

1 Like

The reason can be found in the original implementation of OWLv2 from scenic:

# Due to the DETR style bipartite matching loss, only one embedding
# feature for each object is "good" and the rest are "background." To find
# the one "good" feature we use the heuristic that it should be dissimilar
# to the mean embedding.

Does it also mean that OWLv2 image-guided-detection is very sensible to noise? just a very small difference in the query bounding box and the result is completely wrong

2 Likes

This seem to be the case here.
I have been trying to make this work for my project and it performs worse using the image_guided_detection method of the og class.
Did you happen to find the solution to make this work?

1 Like

It’s been a while since I worked with Owlv2, so I don’t remember everything in detail. But in the end, I made it work, but please double-check my comment here :smiley:

HF Owl code runs a heuristic to find the good feature that represents the object. Due to DETR bipartite matching loss, even 2 bounding boxes that have high IoU, one can represent the background and the other represents the object. If we choose an incorrect feature, we might end up detecting the background ( The image in my old comment above )

But this is for Owl-v1, not v2, HF repo uses the same logic of v1 but it’s not optimal for Owl-v2. Owl-v2 has an objectness score and we could use it directly to get the best feature instead of relying on the heuristic of v1. It’s confirmed by Google in an issue I asked before: https://github.com/google-research/scenic/issues/989

So, what I remember is that you run Owl-v2 on the reference image, extract the feature with the highest objectness score, and then use this feature for your image-guided detection. Also, be careful to double check the bounding box of the reference object, you can have a case your reference image has many possible objects.

Hope it helps

2 Likes

I will give it a try, and try to modify the class for my workflow. I know I am gonna run into issues, but I’ll give t a try.
This clears lots of things, and it seems like I won’t have to choose the query embedding each time for it and just use argmax to choose the one with highest score.
Only if there was a way to annotate the target image myself, and use the annotated part as a query to make the detections.
However, the given method works also.
Thanks for taking out your time and reply :blush:

2 Likes