Hey all — I wanted to share an idea I’ve been thinking about, and I’d love to hear your thoughts on whether it’s viable, or if anything similar already exists or just for anyone to try.
The Problem
Most vision-language models (like CLIP, BLIP, Flamingo, GPT-4V, etc.) represent an image as either:
- A single embedding vector, or
- A fixed-length sequence of embeddings (e.g., patch tokens from ViT).
This works reasonably well for simple images, but it becomes a clear limitation when dealing with visually complex images — like comics, densely annotated diagrams, infographics, or photos with layered scenes. These types of images contain more information than a single embedding or short sequence can realistically capture.
Once compressed into a limited embedding, a lot of visual detail is simply lost.
The Idea: Iterative Residual Embedding
What if instead of trying to stuff everything into one embedding or fixed-length sequence, we allowed the model to iteratively extract embeddings, each one capturing the residual information not covered by the previous one?
Here’s a conceptual sketch:
- First pass: Extract embedding
E₁
from the full image. - Decode
E₁
into an image approximationI₁
. - Compute residual:
R₁ = I - I₁
(i.e., the information missed in the first pass). - Second pass: Encode
R₁
→ get embeddingE₂
. - Repeat until the residual is negligible or reaches a threshold (e.g., 99% of the original image info captured).
The final representation becomes:
E = [E₁, E₂, …, Eₙ]
This could be passed to a language model (or other downstream tasks) as a variable-length sequence of embeddings.
Why This Could Matter
- Adaptive compression: Complex images get more embeddings; simple ones get fewer.
- Preserves fine detail: Instead of forcing a lossy one-shot embedding, this lets the model refine its understanding over time.
- Compatible with existing tools: You could plug this into CLIP-style or ViT-based architectures with moderate modification.
- Biologically inspired: Humans also build up visual understanding in stages, not all at once.
Open Questions
- Does this already exist under a different name?
- What’s the best way to decode an embedding into an approximate image (for residual computation)? Use a VAE? Diffusion model?
- How do we quantify the “remaining” information — perceptual loss, entropy, something else?
- How should downstream models (e.g. a language model) consume the variable-length embedding set?
- Is this something worth building a prototype around?
I don’t have anything more, so if feel free to comment or expand!