How to train a model to extract specific data from PDFs?

I want to fine-tune a model to recognize specific data from PDFs. What steps do I need to take to make this work?

The PDFs are structured in a way so that that certain data like “address”, “type”, “project name” etc is mostly in the same spot but doesn’t have to be. Optimally the model would automatically detect these things.

Any advice and input is appreciated. Thanks

4 Likes

Did you find any solution for this purpose?

1 Like

If you want LLM to process something other than text, I think many people would use RAG to achieve this.