Request for Recommendations on a Generalised PHI Redaction Workflow Using OCR + Layout-Aware NLP for Complex PDF Structures

maniksais · November 17, 2025, 5:32am

I am working on a PHI redaction workflow that must operate across a large corpus of documents containing significantly varied PDF structures. These include:

Scanned PDFs (single and multi-page)
Selectable text PDFs
Hybrid PDFs
Multi-column documents
Reports with complex tabular and form-like layouts

The goal is to develop a generalised system that can consistently detect and redact PHI elements such as names, MRNs, dates, addresses, phone numbers, and other identifiers across all of these formats.

I have evaluated OCR tools like Tesseract and PaddleOCR as well as NLP-based NER models for entity extraction. However, I am still facing challenges related to layout variability, inconsistent OCR accuracy, multi-column alignment issues, and difficulty in extracting PHI embedded within headers, footers, tables and annotations.
I am seeking advice on the following aspects:

Appropriate model combinations for robust PHI extraction across variable layouts.
Whether layout-aware architectures such as LayoutLM, Donut, or DiT are recommended for this type of redaction workflow.
Techniques for handling mixed structure documents where OCR accuracy varies across pages.
Strategies to normalise OCR output before applying NER.
Suggested open-source pipelines for redaction that balance accuracy and reproducibility
Recommended fine-tuning approaches for improving PHI detection across diverse PDF structures.
I would greatly appreciate any architectural suggestions, model recommendations, or workflow strategies from the community.

John6666 · November 17, 2025, 11:40am

Hmm… Like this?

Topic		Replies	Views
LayoutLM for table detection and extraction Beginners	3	8420	July 11, 2023
Named Entity Recognition for PDFs Beginners	4	3922	December 4, 2023
Reading PDF tables in PDF's with different languages and layouts Beginners	0	1240	February 8, 2024
Table extraction from pdf Beginners	1	2941	July 6, 2022
LayoutLM for extraction of information from tables Research	1	1548	September 29, 2022

Request for Recommendations on a Generalised PHI Redaction Workflow Using OCR + Layout-Aware NLP for Complex PDF Structures

Related topics