How to Detect and Differentiate Diagrams, Figures, and Tables in a Scanned PDF?

riddhi810 · March 27, 2025, 11:07am

Question:

I am working with scanned PDFs (not native PDFs) and need to extract, detect, or differentiate diagrams, figures, and tables from the scanned pages. Since these PDFs contain images instead of selectable text, traditional PDF parsing libraries like PyMuPDF or pdfplumber are not effective.

I have explored OCR-based approaches using Tesseract and Azure OCR, but they primarily extract text and do not specifically distinguish between diagrams, tables, or figures.

What I Need:

A way to detect and extract diagrams, figures, or tables as separate entities from scanned PDFs.
Any open-source Python libraries or deep-learning models that can classify or segment these elements from scanned pages.
If possible, an approach to separate text-based content from visual elements.

John6666 · March 27, 2025, 1:00pm

There is a lot of demand for analyzing PDF and scanned PDF images, but there is no definitive solution.

If Tesseract is not suitable, it may be difficult to use just one model on its own, and it may be necessary to process it in stages while using a normal image processing library such as OpenCV…

I think there are three ways to try this: using the layout analysis function in Tesseract, which is more than just OCR; using an open source layout analysis model; and trying to have a relatively high-performance VLM summarize it directly.

Topic		Replies	Views
CAD PDF Drawing Model Beginners	1	163	January 22, 2025
Models for reading Schematic PDF's Models	2	84	January 28, 2025
Google Document AI Alternative 🤗Transformers	3	848	October 6, 2024
Need Help Separating PDF Content into Paragraphs Using OCR Beginners	0	358	March 14, 2024
Image to text help for personal project Research	1	28	March 28, 2025

How to Detect and Differentiate Diagrams, Figures, and Tables in a Scanned PDF?

Question:

What I Need:

VLM

Layout Analysis

Related topics