I am working with scanned PDFs (not native PDFs) and need to extract, detect, or differentiate diagrams, figures, and tables from the scanned pages. Since these PDFs contain images instead of selectable text, traditional PDF parsing libraries like PyMuPDF or pdfplumber are not effective.
I have explored OCR-based approaches using Tesseract and Azure OCR, but they primarily extract text and do not specifically distinguish between diagrams, tables, or figures.
What I Need:
A way to detect and extract diagrams, figures, or tables as separate entities from scanned PDFs.
Any open-source Python libraries or deep-learning models that can classify or segment these elements from scanned pages.
If possible, an approach to separate text-based content from visual elements.
There is a lot of demand for analyzing PDF and scanned PDF images, but there is no definitive solution.
If Tesseract is not suitable, it may be difficult to use just one model on its own, and it may be necessary to process it in stages while using a normal image processing library such as OpenCV…
I think there are three ways to try this: using the layout analysis function in Tesseract, which is more than just OCR; using an open source layout analysis model; and trying to have a relatively high-performance VLM summarize it directly.