How to train a model to extract specific data from PDFs?

Fireche · September 24, 2023, 4:05pm

I want to fine-tune a model to recognize specific data from PDFs. What steps do I need to take to make this work?

The PDFs are structured in a way so that that certain data like “address”, “type”, “project name” etc is mostly in the same spot but doesn’t have to be. Optimally the model would automatically detect these things.

Any advice and input is appreciated. Thanks

HassanMahmood · January 30, 2025, 7:14am

Did you find any solution for this purpose?

John6666 · January 30, 2025, 8:50am

If you want LLM to process something other than text, I think many people would use RAG to achieve this.

Topic		Replies	Views
Fine-Tuning a Language Model with Data Extracted from Multiple PDFs for a Chat Interface 🤗Transformers	2	2607	November 5, 2024
How to use a LLM for specific task Beginners	2	81	March 14, 2025
Training a model for a PDF with OCR - where to begin? Beginners	4	10609	October 27, 2024
LLM model for table data Languages at Hugging Face	8	41190	July 21, 2024
Generate dataset for fine tuning on PDF(s) 🤗Transformers	6	3251	September 3, 2024

How to train a model to extract specific data from PDFs?

Related topics