[Owlv2 - image_guided_detection - embed_image_query] Why choosing the least similar box from selected ones?

I’m trying to understand the owlv2 image_guided_detection and have a question.

From this tutorial about OWLv2 zero_oneshot_owlv2_ObjectionDetection, the author said that the image_guided_detection part uses a heuristic way to get the patch in the source image which most likely contains an object

Looking at the source code at https://github.com/huggingface/transformers/blob/main/src/transformers/models/owlv2/modeling_owlv2.py

The heuristic he mentioned I believe is here:

            iou_threshold = torch.max(ious) * 0.8

            selected_inds = (ious[0] >= iou_threshold).nonzero()
            if selected_inds.numel():
                selected_embeddings = class_embeds[i][selected_inds.squeeze(1)]
                mean_embeds = torch.mean(class_embeds[i], axis=0)
                mean_sim = torch.einsum("d,id->i", mean_embeds, selected_embeddings)
                best_box_ind = selected_inds[torch.argmin(mean_sim)]

So what I understand from this code:

  1. Select a list of bbox
  2. Calculate the mean of embedding of these bbox
  3. Calculate the similarity of the mean_embedding and all bbox_embeddings
  4. Select the bbox which is the least similar to the mean via best_box_ind = selected_inds[torch.argmin(mean_sim)]

So, why choose the least similar here instead of the most similar one with argmax? We want to choose a box closest to the mean, right?



Maybe the reason for choosing the least similar is to remove noise because when I change from argmin to argmax. I have a lot of False Positives ( even when the chosen bounding box is not different too much for both cases, very weird :thinking:)

Still not sure what is the best way to work with OwlV2 for image-guided detection, anyone know the best practices?


The reason can be found in the original implementation of OWLv2 from scenic:

# Due to the DETR style bipartite matching loss, only one embedding
# feature for each object is "good" and the rest are "background." To find
# the one "good" feature we use the heuristic that it should be dissimilar
# to the mean embedding.

Does it also mean that OWLv2 image-guided-detection is very sensible to noise? just a very small difference in the query bounding box and the result is completely wrong