Document AI requirements and MCP tooling for tasks

Endyion · November 12, 2025, 2:53pm

Hi,

current situation:

I have thousands of different documents (PDF, MSG, EML, DOC, TXT). 80% or more are PDF files.

my goals:

Extract information from documents in a reproducible manner, a scheme that’s always the same depending on the type/intend/subject of the document
Decide what action is required by the document subject: Inform me? Use the extracted data to update some database entries? Store the file somewhere specific? …
Watch my email inbox and do the same as 1.+2. with mails that my have attachments
Having a webinterface like OpenWebUI to give a model tasks to accomplish
the tasks for e.g. depend sometimes from additional data that it has to pull from an API

what I’ve done so far (tests/experiments/research):

I tried different models (deepseek:1.5b/gemma2:2b/llama2/mistral:8b/phi3:14b/phi4/tinyllama/zephyr) with the penalty of running them on very old hardware (4bit quantized, only CPU) resulting in:
a) it took very long: a document extraction takes about ~12min to process
b) the extraction result was different between each prompt and also with the same LLM, what could also be a bad crafted prompt by me
c) rarely wrong assignments like instead of the sender’s information, it took mine
Using annotation tools like docano/Label Studio, for creating datasets with high quality information, to help the LLM in detection. I was not able to get the datasets ready.
Detectron2: could not get it running
Docling: not enough computing power by know

open questions:

How to better help the LLM, what document types exist by their layout or certain keywords and therefore it knows what the document subject is. With that information it knows, what data it has to extract, give back as for e.g. JSON and in what form.
Docarray is the way to go?
Are there optimized LLMs for exactly one topic in ML? Like LayoutLMv3 models are only for document analysis/extraction and command r+ for tooling?
What LLMs and techniques are recommended?
Have you done smiliar and succeeded?

Thank you for your help

John6666 · November 13, 2025, 12:41am

I don’t know much about the actual behavior when MCP is involved, so I’ll just leave the resources here for now…

Topic		Replies	Views
Open-source LLMs and tools for scientific PDFs data extraction and to MD conversion Models	0	450	June 18, 2024
Fine-tune LLM model for document analysis Models	0	345	September 18, 2024
What model(s) to use? Beginners	0	245	April 24, 2023
I need your opinion about Metadata Extraction Beginners	0	265	March 27, 2024
Purely extractive Language Models? Beginners	2	605	November 28, 2023