Document AI requirements and MCP tooling for tasks

Hi,

current situation:

I have thousands of different documents (PDF, MSG, EML, DOC, TXT). 80% or more are PDF files.

my goals:

  1. Extract information from documents in a reproducible manner, a scheme that’s always the same depending on the type/intend/subject of the document
  2. Decide what action is required by the document subject: Inform me? Use the extracted data to update some database entries? Store the file somewhere specific? …
  3. Watch my email inbox and do the same as 1.+2. with mails that my have attachments
  4. Having a webinterface like OpenWebUI to give a model tasks to accomplish
    the tasks for e.g. depend sometimes from additional data that it has to pull from an API

what I’ve done so far (tests/experiments/research):

  1. I tried different models (deepseek:1.5b/gemma2:2b/llama2/mistral:8b/phi3:14b/phi4/tinyllama/zephyr) with the penalty of running them on very old hardware (4bit quantized, only CPU) resulting in:
    a) it took very long: a document extraction takes about ~12min to process
    b) the extraction result was different between each prompt and also with the same LLM, what could also be a bad crafted prompt by me
    c) rarely wrong assignments like instead of the sender’s information, it took mine
  2. Using annotation tools like docano/Label Studio, for creating datasets with high quality information, to help the LLM in detection. I was not able to get the datasets ready.
  3. Detectron2: could not get it running
  4. Docling: not enough computing power by know

open questions:

  • How to better help the LLM, what document types exist by their layout or certain keywords and therefore it knows what the document subject is. With that information it knows, what data it has to extract, give back as for e.g. JSON and in what form.
  • Docarray is the way to go?
  • Are there optimized LLMs for exactly one topic in ML? Like LayoutLMv3 models are only for document analysis/extraction and command r+ for tooling?
  • What LLMs and techniques are recommended?
  • Have you done smiliar and succeeded?

Thank you for your help

1 Like

I don’t know much about the actual behavior when MCP is involved, so I’ll just leave the resources here for now

1 Like