Any model that takes in a clean PDF and outputs a JSON of all the fillable fields that should be added to it + coordinates?

So basically I’m searching for a model that can take a clean PDF, and output a JSON of all the fillable fields, with relevant names, types, page number and most importantly, coordinates in the PDF file. What I mean by clean is just a PDF form without the fields embedded into it to allow filling. I already know how to add the fields to the PDF it self and save it, but I need something that will output me the fields to add and where to add them.

Is there any model like this in HuggingFace? If you want examples of this, while there aren’t a lot at all online, see https://www.youtube.com/watch?v=-SlgDIjVW1g

1 Like

For fields and coordinates stored in PDF in text format, it is faster and more reliable to extract them using a normal Python program rather than using a generated AI. There are vision models for recognizing coordinates, but if you want to use them for analyzing multiple text fields, the models are limited.

I think that conversion to markdown and interpretation can be done using a general high-performance VLM. In order to be handled by VLM, it is necessary for it to be an image, but there are several libraries for Python for converting PDF to images.
In addition, there is a high possibility that LLaVA etc. can also be used to process the contents of the markdown output to some extent.

VLMs that are good at handling images of documents such as PDF are introduced below.

Thanks for the reply. I should’ve mentioned that I tried microsoft/layoutlmv3-base and it gave random coordinates honestly so that experiment was a failure.

I think there might have been a misunderstanding though. The reason why I need an LLM or AI for this is because the PDF is just a PDF, it has no fillable forms or anything like that. What I need from the LLM is to recognize the places where fillable fields should be added, give them a name and take their type (i.e. text, checkbox, etc.) and the coordinates in the page. And it needs to work for every document (they should all be specially made and not random or scanned). A python library wouldn’t help in this case (at least I believe it wouldn’t help…)

Just as an example, take this document. It doesn’t have any fillable fields, it’s clean. You would need the LLM to output all the places where a field should be added (for example “Patient’s Name and Address”, type text, coordinats: x,y,w,h)

1 Like

That’s certainly not something that can be done with a non-AI library alone…

However, for images of text that use fonts that are not handwritten, libraries that are in the realm of deep learning are useful to a certain extent.

Anyway, at the moment, I don’t think it’s easy to do with just one open-source AI model…
I think we need to first divide the process and make a plan to appropriately assign it to various programs and models.
Of course, it might be possible to use VLM or multi-modal LLM alone by fine-tuning the model for the purpose or using an extremely large model regardless of cost…

However, if it is at a level that cannot be processed by the LayoutLM series, I think a combined approach is more realistic. Even in existing AI-based services, those that use AI in the core part and pre- and post-process with normal programs stand out more than those that are based solely on AI. This is especially advantageous when accuracy is required.

There is also one more difficult thing about the coordinates: no matter what VLM you use, the image passed to the model will be resized beforehand to match the model.

So the resolution of the image the model is actually looking at will be different from the resolution of the image the user thought they were passing. There will be a discrepancy there. You will probably need to devise a way to have the model return the position as a percentage of up, down, left and right.

Thanks again for the reply. Do you think the example I provided uses multiple programs and models? Just by their demo, it looks fantastic. I’m confused as to what is your suggestion to what I should plan.

Also regarding the resolution, I believe I saw some that return an approximation of the x and y coordinates relative to the starting point in percentages so you can later obtain the original width and height and apply the coordinates to them. See

1 Like

Do you think if I convert each PDF page to an image, it’ll be easier to detect the coordinates? Or that it will be the same.

1 Like

Do you think the example I provided uses multiple programs and models?

Yes. Or, if it doesn’t support handwriting, it seems like this could work…

Do you think if I convert each PDF page to an image, it’ll be easier to detect the coordinates?

Yes. If you put all the PDFs into one image, the vertical coordinates will be completely out of order… or at least the accuracy will drop significantly.

If you don’t need coordinates, I think using PyMuPDF is the easiest, but in this case, we need coordinates, so it’s a little special.

So I don’t need to support handwriting as the PDFs I’m trying to process are all blank (meaning they are empty forms).

And when I said PDF->Image I was talking about converting each page to it’s own image, and processing each image (or page) individually. I think this will generally help if you keep the dimensions exactly the same, and use PDF coordinates that are in respect to the origin (i.e. in percentages, not pixels or otherwise).

However since I must have coordinates in order to insert fillable fields in these places, libraries such as Python’s PyMuPDF or Node’s pdf-lib won’t help in this case.

I was trying to figure out if there’s a way to involve OCRs to this process, however as I can only find the texts with them, I can’t know for certain where the field should be added (below the label? to it’s right? maybe there are multiple fields for 1 label?)

1 Like

For example, can’t you get the coordinates this way?

# Get bounding box estimates
print(pytesseract.image_to_boxes(Image.open('test.png')))

With this you’d be able to get the bounding boxes of the text of the fields, and only if you’d know what strings to search in advance. In the link you provided, I basically want to automate his step 1. You can see he goes into Paint (or some other program) and manually check the x,y,w,h of each field. This is exactly the process I wish to automate so I can automatically add fillable fields in these places, without needing manual intervention. The manual part is a major pain-point where you get a bulk of documents and you need to manually go over each one and assign fields. Sometimes, forms can have hundreds upon hundreds of fields, so I wish to automate this process.

1 Like

In such cases, I think the easiest way is to use a Python program with a for loop to iterate. The reason I can’t unreservedly recommend VLM in this case is that the more complex the output is, the more likely it is that the output will be incomplete. I think this is also true for ChatGPT and Gemini.

If you use a fast pytesseract, you can cheat with the number of times it is processed. If you only do one at a time, it is unlikely to make a mistake. Of course, if you have abundant GPU resources, you can do the same thing with a high-performance VLM, but you probably won’t need it for anything other than handwriting…

Of course, if there is a program or model that can perform the same task, there is no need to stick with pytesseract, but please consider this as an example.