Hi everyone,
I’m a student with a large volume of PDF books and articles to process. I’d like to develop a code solution to automatically separate the content into distinct paragraphs based on their indentation. Here’s my challenge:
Indentation Issue: Simple PDF text extraction doesn’t reliably capture the paragraph structure.
OCR Requirement: I believe I’ll need to use Optical Character Recognition (OCR) to accurately identify paragraph breaks.
Questions:
Are there existing models or libraries that can handle this type of OCR-based paragraph separation?
If I need to build something, could you provide guidance on how to start? I’m open to any programming language
suggestions.
Thank you for your help!