Need Help Separating PDF Content into Paragraphs Using OCR

Hi everyone,

I’m a student with a large volume of PDF books and articles to process. I’d like to develop a code solution to automatically separate the content into distinct paragraphs based on their indentation. Here’s my challenge:

Indentation Issue: Simple PDF text extraction doesn’t reliably capture the paragraph structure.
OCR Requirement: I believe I’ll need to use Optical Character Recognition (OCR) to accurately identify paragraph breaks.
Questions:

Are there existing models or libraries that can handle this type of OCR-based paragraph separation?

If I need to build something, could you provide guidance on how to start? I’m open to any programming language
suggestions.

Thank you for your help!

1 Like