For fields and coordinates stored in PDF in text format, it is faster and more reliable to extract them using a normal Python program rather than using a generated AI. There are vision models for recognizing coordinates, but if you want to use them for analyzing multiple text fields, the models are limited.
I think that conversion to markdown and interpretation can be done using a general high-performance VLM. In order to be handled by VLM, it is necessary for it to be an image, but there are several libraries for Python for converting PDF to images.
In addition, there is a high possibility that LLaVA etc. can also be used to process the contents of the markdown output to some extent.
VLMs that are good at handling images of documents such as PDF are introduced below.