I need your opinion about Metadata Extraction

princi97 · March 27, 2024, 11:01am

Hello everyone,

I’m writing this post to seek your opinion on the methodology I’m using to extract metadata from a PDF document. My idea was to utilize one of the many Python libraries to extract text from a PDF (or use OCR if the file isn’t text-based) and use this text as the “context” for a Language Model (LLM) to perform static queries (such as determining the total amount of the invoice). Do you think this is a valid approach? Can you suggest better approaches? I’d like to minimize annotation phases, which is why I prefer this approach over LayoutLMv3 or Donut for feature extraction.

Thank you!

Topic		Replies	Views
Extraction of tabular data from a PDF Beginners	0	61	May 6, 2025
Fine-Tuning a Language Model with Data Extracted from Multiple PDFs for a Chat Interface 🤗Transformers	2	2598	November 5, 2024
Open-source LLMs and tools for scientific PDFs data extraction and to MD conversion Models	0	407	June 18, 2024
Transformer model for pdf invoice field extraction 🤗Transformers	0	800	January 15, 2024
LLM model for table data Languages at Hugging Face	8	41120	July 21, 2024

I need your opinion about Metadata Extraction

Related topics