Which Model Should I Choose: TrOCR, TrOCR + LayoutLM, or Donut?

I am developing a web application to process a collection of scanned domain-specific documents with five different types of documents, as well as one type of handwritten form. The form contains a mix of printed and handwritten text, while others are entirely printed but all of the other documents would contain the name of the person.

Key Requirements:

  1. Search Functionality – Users should be able to search for a person’s name and retrieve all associated scanned documents.
  2. Key-Value Pair Extraction – Extract structured information (e.g., First Name: John), where the value (“John”) is handwritten.

Model Choices:

  • TrOCR (plain) – Best suited for pure OCR tasks, but lacks layout and structural understanding.
  • TrOCR + LayoutLM – Combines OCR with layout-aware structured extraction, potentially improving key-value extraction.
  • Donut – A fully end-to-end document understanding model that might simplify the pipeline.

Would Donut alone be sufficient, or would combining TrOCR with LayoutLM yield better results for structured data extraction from scanned documents?

I am also open to other suggestions if there are better approaches for handling both printed and handwritten text in scanned documents while enabling search and key-value extraction.

1 Like

It seems that Donut is appropriate.


For your web application, the optimal choice is Donut. Here’s a structured summary of the reasoning and conclusion:

Analysis and Conclusion:

  1. Use Case Requirements:

    • Search Functionality: Accurate extraction of names (handwritten and printed) is crucial.
    • Key-Value Pair Extraction: Ability to structure data, especially where values may be handwritten.
  2. Model Evaluation:

    • TrOCR Alone: Effective for pure OCR tasks but lacks layout awareness and may struggle with complex layouts and handwritten text.
    • TrOCR + LayoutLM: Enhances structured data extraction by adding layout awareness. However, potential limitations with handwritten text recognition.
    • Donut: A comprehensive, end-to-end model capable of simultaneous OCR and layout understanding, suitable for both printed and handwritten text.
  3. Advantages of Donut:

    • Simplicity and Efficiency: Simplifies the pipeline by handling OCR and layout in one model, possibly improving performance and reducing complexity.
    • Community and Documentation: Active community and detailed documentation, aiding in troubleshooting and updates.
    • Customization: Supports fine-tuning for domain-specific documents, enhancing accuracy.
  4. Considerations:

    • Testing: Experiment with both models using sample documents to evaluate performance on handwritten text and structured data extraction.
    • Resource Usage: Assess the impact of Donut’s resource requirements on application scalability.

Final Decision:

Choose Donut as it provides a streamlined solution for both OCR and layout understanding, effectively handling mixed text types and simplifying your application’s setup.