Need Help Separating PDF Content into Paragraphs Using OCR

Yodem · March 14, 2024, 8:03am

Hi everyone,

I’m a student with a large volume of PDF books and articles to process. I’d like to develop a code solution to automatically separate the content into distinct paragraphs based on their indentation. Here’s my challenge:

Indentation Issue: Simple PDF text extraction doesn’t reliably capture the paragraph structure.
OCR Requirement: I believe I’ll need to use Optical Character Recognition (OCR) to accurately identify paragraph breaks.
Questions:

Are there existing models or libraries that can handle this type of OCR-based paragraph separation?

If I need to build something, could you provide guidance on how to start? I’m open to any programming language
suggestions.

Thank you for your help!

Topic		Replies	Views
Looking for a model for text extraction from complex background Beginners	1	1939	April 22, 2024
I need your opinion about Metadata Extraction Beginners	0	259	March 27, 2024
Challenges of Using PDF Documents as Input for RAG: Text Flow, Tokenization, and Semantic Coherence Beginners	1	478	November 4, 2024
Split text from Annual Report pdfs into paragraphs Beginners	1	188	April 10, 2024
Seperating Paragraphs in Text File Based on Topics for Zero-Shot Classification Beginners	1	215	May 8, 2024

Need Help Separating PDF Content into Paragraphs Using OCR

Related topics