Caching image prototype embeddings for image-guided object detection using OWL-ViT

The OWL-ViT model currently supports image-guided one-shot object detection by using reference image embeddings as the input to the classification head instead of the text embedding. This is implemented by the image_guided_detection method.

There are 2 problems

  • it doesn’t support passing multiple reference images as input
  • the reference image is passed through the image encoder every time

In practice I’d like to use the model’s image_guided_detection for inference on larger dataset and computing the reference image query embedding for each image I’m doing an inference on is clearly wasteful, as the query embeddings are not dependent on the target image.

  1. Is there a way to cache the query image embeddings?
  2. And is there a way to use multiple query images for one target image?


In practice One-shot learning is an extreme case of Few-Shot learning and it’s usually very hard / impossible to represent the whole class with only one reference image.

Therefore a natural extension is to use multiple prototypical images capturing the detected object in various situations, lightning conditions etc.
But as of now, the running time of the OWL-ViT scales linearly with the number of query images, which makes it impractical for real-world usage.