Docling image captioning best VLM

What is the current SOTA model for captioning images in documents?

I need good descriptions of diagrams. Most of the ones I have seen have very basic descriptions “the image contains a woman in a blue dress”. I need more like “The figure shows a flowchart representing a process of… that starts with…and ends with…key steps are…”

Or “The image depicts a scene in which people walk about in a modern cafe, key elements of the cafes design are…”

In other words I need a good paragraph that offers some insight into the image.

Any suggestions on models?

1 Like

I’m not sure which VLM is strong in understanding the context of image content…
How about trying out some VLM that seem to perform well to some extent…