I’m trying to understand the owlv2 image_guided_detection and have a question.
From this tutorial about OWLv2 zero_oneshot_owlv2_ObjectionDetection, the author said that the image_guided_detection part uses a heuristic way to get the patch in the source image which most likely contains an object
Looking at the source code at https://github.com/huggingface/transformers/blob/main/src/transformers/models/owlv2/modeling_owlv2.py
The heuristic he mentioned I believe is here:
iou_threshold = torch.max(ious) * 0.8
selected_inds = (ious[0] >= iou_threshold).nonzero()
if selected_inds.numel():
selected_embeddings = class_embeds[i][selected_inds.squeeze(1)]
mean_embeds = torch.mean(class_embeds[i], axis=0)
mean_sim = torch.einsum("d,id->i", mean_embeds, selected_embeddings)
best_box_ind = selected_inds[torch.argmin(mean_sim)]
best_class_embeds.append(class_embeds[i][best_box_ind])
best_box_indices.append(best_box_ind)
So what I understand from this code:
- Select a list of bbox
- Calculate the mean of embedding of these bbox
- Calculate the similarity of the mean_embedding and all bbox_embeddings
- Select the bbox which is the least similar to the mean via
best_box_ind = selected_inds[torch.argmin(mean_sim)]
So, why choose the least similar here instead of the most similar one with argmax? We want to choose a box closest to the mean, right?
Thanks