Hi! I’m looking for a model which can accomplish the following:
1- Analyze or parse a PDF file which contains a single layer bitmap image (scanned) of a highly illustrated magazine or book. The text is generally written in two columns (but not always). There are often sidebars with information such as a description of a picture, or a table. Often text is written with a colorful background behind the text. Often there is artwork or illustration and text is written around the illustration.
2-The image being processed does not have to be in PDF format. I can easily convert the PDF pages into individual jpg (or other image file extension types) images for processing.
3- The overall body of the text is generally written in chapters which are segmented by chapter title headers. Often the body of the text will contain subparagraphs which are numbered or denoted by a LETTER-number (ie. A-1). I would like the process to be able to to identify these paragraphs, and/or chapters, and upon output, be able to denote or segment them. The purpose is to be able to take the final OCR’d text and convert it to xml. Having the paragraphs and chapters separated will allow the easy transfer to xml by chapter or paragraph, rather than having to go through an giant block of text and manually separate the chapters.
4- Last, OCR as accurately as possible for minimal manual editing. I have the ability to preprocess images from PDF to JPG, binarize, grayscale, etc.
Thank you for any help in pointing me in the right direction! This would be HUGE if I can find a solution!