I am working on a PHI redaction workflow that must operate across a large corpus of documents containing significantly varied PDF structures. These include:
Scanned PDFs (single and multi-page)
Selectable text PDFs
Hybrid PDFs
Multi-column documents
Reports with complex tabular and form-like layouts
The goal is to develop a generalised system that can consistently detect and redact PHI elements such as names, MRNs, dates, addresses, phone numbers, and other identifiers across all of these formats.
I have evaluated OCR tools like Tesseract and PaddleOCR as well as NLP-based NER models for entity extraction. However, I am still facing challenges related to layout variability, inconsistent OCR accuracy, multi-column alignment issues, and difficulty in extracting PHI embedded within headers, footers, tables and annotations.
I am seeking advice on the following aspects:
Appropriate model combinations for robust PHI extraction across variable layouts.
Whether layout-aware architectures such as LayoutLM, Donut, or DiT are recommended for this type of redaction workflow.
Techniques for handling mixed structure documents where OCR accuracy varies across pages.
Strategies to normalise OCR output before applying NER.
Suggested open-source pipelines for redaction that balance accuracy and reproducibility
Recommended fine-tuning approaches for improving PHI detection across diverse PDF structures.
I would greatly appreciate any architectural suggestions, model recommendations, or workflow strategies from the community.