Docling image captioning best VLM

swtb · April 25, 2025, 2:37pm

What is the current SOTA model for captioning images in documents?

I need good descriptions of diagrams. Most of the ones I have seen have very basic descriptions “the image contains a woman in a blue dress”. I need more like “The figure shows a flowchart representing a process of… that starts with…and ends with…key steps are…”

Or “The image depicts a scene in which people walk about in a modern cafe, key elements of the cafes design are…”

In other words I need a good paragraph that offers some insight into the image.

Any suggestions on models?

John6666 · April 25, 2025, 3:33pm

I’m not sure which VLM is strong in understanding the context of image content…
How about trying out some VLM that seem to perform well to some extent…

system · April 29, 2025, 7:34pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multimodal training 🤗Transformers	4	51	March 21, 2025
Is there specific generative model to describe User Interfaces? Models	4	80	April 2, 2025
Top performer for image classification Beginners	3	27	June 6, 2025
Integrated gradients for explainability of VLMs Research	0	97	February 3, 2025
Extracting metadata from images using LLMs Beginners	2	32	June 18, 2025

Docling image captioning best VLM

Related topics