Any model that takes in a clean PDF and outputs a JSON of all the fillable fields that should be added to it + coordinates?

That’s certainly not something that can be done with a non-AI library alone…

However, for images of text that use fonts that are not handwritten, libraries that are in the realm of deep learning are useful to a certain extent.

Anyway, at the moment, I don’t think it’s easy to do with just one open-source AI model…
I think we need to first divide the process and make a plan to appropriately assign it to various programs and models.
Of course, it might be possible to use VLM or multi-modal LLM alone by fine-tuning the model for the purpose or using an extremely large model regardless of cost…

However, if it is at a level that cannot be processed by the LayoutLM series, I think a combined approach is more realistic. Even in existing AI-based services, those that use AI in the core part and pre- and post-process with normal programs stand out more than those that are based solely on AI. This is especially advantageous when accuracy is required.