Looking for a model for text extraction from complex background

AngelMuerte · July 6, 2023, 2:21pm

Hi! I’m looking for a model which can accomplish the following:
1- Analyze or parse a PDF file which contains a single layer bitmap image (scanned) of a highly illustrated magazine or book. The text is generally written in two columns (but not always). There are often sidebars with information such as a description of a picture, or a table. Often text is written with a colorful background behind the text. Often there is artwork or illustration and text is written around the illustration.
2-The image being processed does not have to be in PDF format. I can easily convert the PDF pages into individual jpg (or other image file extension types) images for processing.
3- The overall body of the text is generally written in chapters which are segmented by chapter title headers. Often the body of the text will contain subparagraphs which are numbered or denoted by a LETTER-number (ie. A-1). I would like the process to be able to to identify these paragraphs, and/or chapters, and upon output, be able to denote or segment them. The purpose is to be able to take the final OCR’d text and convert it to xml. Having the paragraphs and chapters separated will allow the easy transfer to xml by chapter or paragraph, rather than having to go through an giant block of text and manually separate the chapters.
4- Last, OCR as accurately as possible for minimal manual editing. I have the ability to preprocess images from PDF to JPG, binarize, grayscale, etc.

Thank you for any help in pointing me in the right direction! This would be HUGE if I can find a solution!

mdietterle · April 22, 2024, 5:00pm

Hi,

Did you manage to advance on this item? I’m trying a similar solution, and I haven’t been able to make any progress.

Topic		Replies	Views
Models for reading Schematic PDF's Models	2	86	January 28, 2025
Complex OCR scenarios Models	1	61	April 4, 2025
Need Help Separating PDF Content into Paragraphs Using OCR Beginners	0	360	March 14, 2024
Model Recommendation for table extraction from PDF Models	3	3972	July 14, 2024
Extracting and segmenting handwritten and printed text in the images 🤗 Course Projects	1	516	January 4, 2025

Looking for a model for text extraction from complex background

Related topics