Hi everyone,
I’m trying to build a proof of concept for a visual annotation tool: let’s say I have a collection of images (think of paintings) and I want to find the occurrence in the dataset of a visual detail (e.g. the signature of a particular artist). I don’t need to localize it on the pictures, I just want to retrieve a list of images having a similar detail in it.
I treated the problem as an image similarity task and with a very naive approach I used the ImageFeatureExtractionPipeline and the entire resulting Tensor (which I guess are all the hidden states?) from each image in my dataset as an embedding.
I saved those on filesystem, then cropped a detail on one of the images and used it as a test: I just computed the embedding in the same way and then computed the cosine similarity against each previous entry.
Results are pretty varied based on the input and generally not that satisfying. The model that achieved the best results was vit-base-patch16-224-in21k, I also tried with Dinov2-base but it seems to be less performant.
Is there any other approach I could consider? Is a CNN like ResNet more suited for this kind of task? Should I focus on some other model?
Any suggestion is welcome,
thanks!