Avoid Re-encoding Reference Images in Vision-LLM When Comparison Criteria Are User-Defined

Hi everyone,

I’m working with a Vision-LLM (like Qwen-VL / LLaVA / llama.cpp-based multimodal models) where I need to compare new images against reference images. The key part of my use case is that users define the comparison criteria (e.g., fur length, ear shape, color patterns), and I’m using image-to-text models to evaluate how well a new image matches a reference according to these criteria.

Currently, every time I send a prompt including the reference images, the model re-encodes them from scratch. From the logs, I can see:
llama-server

encoding image slice...
image slice encoded in 3800–4800 ms
decoding image batch ...

Even for the same reference images, this happens every single request, which makes inference slow.

Questions:

  • Has anyone dealt with user-defined comparison criteria in Vision-LLM pipelines?

  • Are there ways to cache or pre-load reference images in llama.cpp / Hugging Face pipelines to avoid repeated encoding?

  • What are recommended strategies to efficiently compare new images against a set of references using image-to-text models without reprocessing the reference images each time?

Thanks in advance for any advice or examples!

1 Like

Are there ways to cache or pre-load reference images in llama.cpp / Hugging Face pipelines to avoid repeated encoding?

This is an area where things can change significantly based on a single factor, such as backend or model repository metadata (for example, even when using the same GGUF from Llama.cpp and Ollama, the behaviour of the vision component differs…), so I think it will be quite time-consuming…
For now, with VLMs, you’re more likely to be at the mercy of differences in implementation between models than with LLMs, so implementing your own cache can be a hassle…


The clean way to think about your problem is:

  • the reference image is stable,
  • the comparison rule is not.

So the reusable thing is not the final answer and not the raw image. It is the reference image’s post-vision representation: the output of the vision encoder and projector, plus the metadata the model needs to place those features back into the LLM. Current docs and issue threads across vLLM, Qwen, and NVIDIA’s multimodal serving stack all point to that same boundary. (vLLM)

What is happening in your current setup

Your llama-server log is showing the expensive vision path running again on every request. In a typical VLM stack, the path is:

  1. image preprocessing,
  2. vision encoder,
  3. projector or bridge into the LLM space,
  4. language-model prefill and decode.

If you send the same reference image again as raw image input, the server usually repeats steps 1 to 3 unless the runtime explicitly supports caching or injecting precomputed vision features. That is exactly the kind of repeated work vLLM’s multimodal RFC calls out: identical media being re-encoded across requests wastes encoder and projector compute, bandwidth, and scheduling capacity. (GitHub)

Direct answers

1. Has anyone dealt with user-defined comparison criteria?

Yes.

Not always under that exact phrase, but the closest public work treats the problem as instruction-aware multimodal matching rather than fixed similarity. VLM2Vec explicitly supports instruction-following multimodal embeddings for combinations of image and text, and Qwen3-VL-Embedding plus Qwen3-VL-Reranker are built for multimodal retrieval and reranking, with the embedding model handling recall and the reranker handling precise relevance scoring. That is very close to “the user defines what counts as similar this time.” (Tiger AI Lab)

2. Can you cache or preload reference images in llama.cpp or Hugging Face?

llama.cpp: partially in theory, poorly in practice today for multimodal. The server README documents slot save and restore with --slot-save-path and /slots/{id}?action=save|restore, but current open issues show that vision-enabled and --mmproj setups still have serious limits around slot persistence and cache reuse. There is an open PR adding /vision/embedding and image_embedding inputs, which is exactly the direction you want, but it is still an open PR, not stable baseline functionality. (GitHub)

Hugging Face Transformers: yes, but usually through a model-specific wrapper, not a polished universal API for every VLM. Qwen2.5-VL exposes get_image_features(pixel_values, image_grid_thw), which is the right seam for extracting reusable visual features. But the low-level path many people try next, passing those features back through inputs_embeds, has shown regressions and shape mismatches in the issue tracker. So it is possible, but brittle if you build directly on the lowest-level generation plumbing. (Hugging Face)

vLLM: yes, much more directly. Its docs show content-hash-based cached multimodal inputs, stable multi_modal_uuids, the ability to skip resending cached media, and support for image_embeds input, including Qwen-family cases that require image_grid_thw. vLLM also has an EncoderCacheManager specifically for multimodal encoder outputs. (vLLM)

3. What is the recommended strategy?

For your case, the best strategy is:

  • encode each reference image once,
  • store its visual package,
  • let the user define criteria later,
  • reuse the cached reference features,
  • only re-encode the new candidate image.

If you have many references, add a first retrieval stage so you do not run the full generative comparison against every reference. That retrieval stage can use an instruction-aware embedding model, and the final stage can use a generative VLM or reranker for detailed, criterion-by-criterion judgment. (NVIDIA Docs)


The architecture I would use

A. Split the pipeline into fixed work and variable work

Fixed work

For each reference image, do this once:

  • preprocess with the exact model processor,
  • run the vision encoder and projector,
  • store the resulting features.

Variable work

For each user request:

  • parse the user’s rubric,
  • encode the new candidate image,
  • fetch the cached reference features,
  • run the final comparison.

This works because your rubric changes, but the reference image does not. Encoder cache is the right optimization for that pattern. NVIDIA’s Dynamo docs describe this explicitly: the embedding cache stores vision encoder outputs and reuses them when the same image appears again, and they also state that this is separate from KV cache. (NVIDIA Docs)

B. Cache a visual package, not just a tensor

For Qwen-family models, the reusable object is not just image_embeds. You also need the metadata that tells the model how those features map into the LLM side. vLLM’s Qwen examples show that image_embeds for Qwen2-VL must be paired with image_grid_thw, and the Qwen2.5-VL docs describe image_grid_thw as the temporal, height, and width feature shape of each image in the LLM. (vLLM)

So I would store, per reference image:

  • image_embeds
  • image_grid_thw or equivalent model-specific metadata
  • model ID and revision
  • processor ID and revision
  • preprocessing settings
  • source image hash

The last three are an engineering recommendation, not a literal API requirement. But Qwen’s docs show that preprocessing settings such as min_pixels and max_pixels change resolution and therefore compute and feature layout, so they belong in your cache key if you want correctness. (Hugging Face)

C. Add a retrieval stage if you have many references

If you have more than a small handful of references, do not ask a generative VLM to compare the candidate image against every reference. That is the expensive path.

Use:

  1. an embedding model for coarse recall,
  2. a reranker or generative VLM for final scoring.

Qwen3-VL-Embedding is designed exactly as an embedding plus reranking pair, where the embedding model handles the initial recall stage and the reranker does precise scoring. VLM2Vec is also relevant because it supports instruction-guided multimodal embeddings, which matches your “user-defined criteria” requirement better than plain task-agnostic similarity. (GitHub)

That means a user query like:

short fur, triangular ears, dark forehead stripes

can first be used to retrieve the top few candidate references, and then only those finalists go through the expensive detailed comparison stage. (Tiger AI Lab)


What I would do in each stack

1. If you must stay on llama.cpp

This is the hardest path today.

The good news is that llama-server already documents slot save and restore for prompt cache. The bad news is that current open issues show those capabilities are still blocked or incomplete for multimodal contexts and even for some text-only conversations when --mmproj is loaded. One open issue says slot save for vision-enabled models does not work; another requests slot save or restore for hybrid Qwen multimodal use; another says that loading --mmproj can block slot persistence, context shift, and prompt cache reuse because the server treats “multimodal capability exists” as if “this slot contains images.” (GitHub)

So for llama.cpp, my advice is:

  • do not count on multimodal slot persistence as your main solution today,
  • use a long-lived in-memory session only as a temporary optimization,
  • watch the /vision/embedding work closely,
  • or move the multimodal serving layer elsewhere.

The open PR is important because it adds /vision/embedding and image_embedding inputs to llama-server, explicitly to decouple image understanding from loading and running the visual projector every time. That is the right direction, but it is not merged baseline functionality yet. (GitHub)

2. If you stay in raw Hugging Face

Use a custom wrapper around the visual path.

For Qwen2.5-VL, you have a documented image-feature seam: get_image_features(pixel_values, image_grid_thw). That is where I would capture the reference image features and store them. Then I would write a model-specific path to feed those features back in later. (Hugging Face)

What I would not do is make raw inputs_embeds your public application boundary. The Qwen2-VL issue shows that inputs_embeds-based generation can break with tensor-shape mismatches and image-token alignment problems. It is still useful plumbing, but it is not a stable long-term API abstraction. (GitHub)

3. If you can move to vLLM

This is the cleanest open-source fit for your problem.

vLLM already documents:

  • content-hash caching for multimodal items,
  • stable multi_modal_uuids,
  • skipping the actual media payload on a cache hit,
  • direct image_embeds inputs,
  • Qwen-specific support for image_grid_thw,
  • an encoder cache manager for multimodal encoder outputs. (vLLM)

So if your question is “what stack today is closest to the architecture I want,” the answer is vLLM.

That said, the public issue history shows this area is still evolving. vLLM’s own RFC acknowledges repeated media re-encoding as a real problem, which is why the encoder-cache direction exists at all. (GitHub)

4. If you want the clearest production pattern

NVIDIA Dynamo’s multimodal docs are the most explicit statement of the architecture you want:

  • a CPU-side LRU embedding cache stores vision encoder outputs,
  • repeated images reuse cached embeddings,
  • on a cache hit, the encode worker is skipped entirely,
  • embedding cache is separate from KV cache. (NVIDIA Docs)

That is the exact systems pattern your workload wants.


What I would recommend for your exact use case

Option 1. Best overall design

Use:

  • a multimodal embedding model for retrieval or shortlist generation,
  • a generative VLM or reranker for final criterion-aware scoring,
  • a reference feature store that caches post-encoder visual packages.

This gives you flexibility for user-defined criteria without re-encoding fixed references. It also scales better than pairwise generative comparison against every reference. (GitHub)

Option 2. Smallest change from your current setup

If you want minimal changes:

  • keep your current VLM,
  • create a separate offline job that pre-encodes reference images,
  • store those features,
  • patch or wrap the serving layer so requests use cached reference features.

If you stay on llama.cpp, this likely means maintaining a custom branch or waiting for the /vision/embedding work to mature. If you move to vLLM, this is much closer to supported behavior already. (GitHub)

Option 3. Add a structured attribute cache

This is not enough by itself, but it is useful.

For each reference image, generate a structured sidecar once:

{
  "fur_length": "short",
  "ear_shape": "upright triangular",
  "color_pattern": "tabby with white chest",
  "facial_markings": "dark forehead stripes"
}

Then many user criteria can be answered or prefiltered cheaply from text and attributes, with the full VLM only used for ambiguous or fine-grained checks.

This is an engineering recommendation rather than something one source states directly. It follows from the documented separation between reusable visual features, retrieval models, and reranking or generation stages. (GitHub)


Pitfalls to avoid

1. Do not confuse encoder cache with KV cache

They solve different problems. NVIDIA’s docs say this explicitly: embedding or encoder cache stores vision encoder outputs, while KV cache reuses attention state after prefill. Your slowdown is showing up during image encoding, so encoder cache matters more than KV cache. (NVIDIA Docs)

2. Do not assume “same image” means “same safe cached embedding”

For Qwen-family models, preprocessing settings affect the feature layout. The docs show min_pixels and max_pixels change resolution, and image_grid_thw is part of the feature contract. So your cache key should include not just the image hash, but also the model revision, processor revision, and preprocessing settings. That is an engineering inference grounded in the documented feature contract. (Hugging Face)

3. Do not build your product around brittle low-level generation hooks

The inputs_embeds route can work, but the Transformers issue history shows it is a sharp edge for multimodal generation. Use it as internal plumbing, not as the main public abstraction of your application if you can avoid it. (GitHub)

4. Do not run full generative comparison against every reference

Use embedding recall first if the reference set is large. That is what Qwen3-VL-Embedding plus its reranker, and instruction-aware embedding work like VLM2Vec, are built for. (GitHub)


Bottom line

For your case, the right answer is:

  • Yes, this is a known problem.
  • Yes, people handle it by splitting fixed visual encoding from variable text-side criteria.
  • llama.cpp has partial building blocks but does not yet look like the cleanest production answer for multimodal reference reuse.
  • Hugging Face can do this with model-specific code, but the low-level path is brittle.
  • vLLM is the best open-source fit today if you want cached media inputs and direct precomputed image_embeds.
  • Dynamo is the clearest reference architecture for “skip vision encoding on cache hit.” (GitHub)

The most useful mental model is:

cache reference-image features once, retrieve first if needed, and let the final VLM judge only the few candidates that matter under the user’s current rubric. (NVIDIA Docs)

Hello, thanks for a such detail anwser!

1 Like