Caching image prototype embeddings for image-guided object detection using OWL-ViT

kubic71 · February 20, 2024, 1:31pm

The OWL-ViT model currently supports image-guided one-shot object detection by using reference image embeddings as the input to the classification head instead of the text embedding. This is implemented by the image_guided_detection method.

There are 2 problems

it doesn’t support passing multiple reference images as input
the reference image is passed through the image encoder every time

In practice I’d like to use the model’s image_guided_detection for inference on larger dataset and computing the reference image query embedding for each image I’m doing an inference on is clearly wasteful, as the query embeddings are not dependent on the target image.

Is there a way to cache the query image embeddings?
And is there a way to use multiple query images for one target image?

Motivation

In practice One-shot learning is an extreme case of Few-Shot learning and it’s usually very hard / impossible to represent the whole class with only one reference image.

Therefore a natural extension is to use multiple prototypical images capturing the detected object in various situations, lightning conditions etc.
But as of now, the running time of the OWL-ViT scales linearly with the number of query images, which makes it impractical for real-world usage.

taher30 · April 11, 2025, 7:59pm

Did you happen to get a solution or alternative for this? I am trying to do something similar.

Topic		Replies	Views
Using Owl ViT Embeddings with cosine similarity 🤗Transformers	1	560	February 15, 2024
Owl-vit batch images inference Beginners	2	1116	May 7, 2024
Improving semantic search with zero shot image classification Beginners	0	193	April 17, 2024
Inference on Multi-GPU/multinode Beginners	4	7475	January 12, 2023
Idea: Iterative Residual Embeddings for Complex Image Understanding Research	0	14	May 21, 2025

Caching image prototype embeddings for image-guided object detection using OWL-ViT

Motivation

Related topics