I’m trying to understand the owlv2 image_guided_detection and have a question.
From this tutorial about OWLv2 zero_oneshot_owlv2_ObjectionDetection, the author said that the image_guided_detection part uses a heuristic way to get the patch in the source image which most likely contains an object
Maybe the reason for choosing the least similar is to remove noise because when I change from argmin to argmax. I have a lot of False Positives ( even when the chosen bounding box is not different too much for both cases, very weird )
The reason can be found in the original implementation of OWLv2 from scenic:
# Due to the DETR style bipartite matching loss, only one embedding
# feature for each object is "good" and the rest are "background." To find
# the one "good" feature we use the heuristic that it should be dissimilar
# to the mean embedding.
Does it also mean that OWLv2 image-guided-detection is very sensible to noise? just a very small difference in the query bounding box and the result is completely wrong
This seem to be the case here.
I have been trying to make this work for my project and it performs worse using the image_guided_detection method of the og class.
Did you happen to find the solution to make this work?
It’s been a while since I worked with Owlv2, so I don’t remember everything in detail. But in the end, I made it work, but please double-check my comment here
HF Owl code runs a heuristic to find the good feature that represents the object. Due to DETR bipartite matching loss, even 2 bounding boxes that have high IoU, one can represent the background and the other represents the object. If we choose an incorrect feature, we might end up detecting the background ( The image in my old comment above )
But this is for Owl-v1, not v2, HF repo uses the same logic of v1 but it’s not optimal for Owl-v2. Owl-v2 has an objectness score and we could use it directly to get the best feature instead of relying on the heuristic of v1. It’s confirmed by Google in an issue I asked before: https://github.com/google-research/scenic/issues/989
So, what I remember is that you run Owl-v2 on the reference image, extract the feature with the highest objectness score, and then use this feature for your image-guided detection. Also, be careful to double check the bounding box of the reference object, you can have a case your reference image has many possible objects.
I will give it a try, and try to modify the class for my workflow. I know I am gonna run into issues, but I’ll give t a try.
This clears lots of things, and it seems like I won’t have to choose the query embedding each time for it and just use argmax to choose the one with highest score.
Only if there was a way to annotate the target image myself, and use the annotated part as a query to make the detections.
However, the given method works also.
Thanks for taking out your time and reply