How to do full page analysis with TrOCR (integrating with text segmentation analysis)

jhhf · May 10, 2023, 4:16pm

Hello @nielsr, I am absolute beginner looking to OCR/HTR both printed and hand-written Hebrew and Yiddish letters.

In (Github TrOCR) you mention “You still need a separate text detection model to get all single-line texts from a PDF.”

Would you please point me to sample Jupyter notebooks/code that will help me understand:

a) how to take Page XML (or other) bounding-box output to send to TrOCR?

b) other methods to line segment a page to then send that information to TrOCR?

If one is looking to ocr/htr full pages, should one even consider TrOCR? You posted some notebooks for LayoutLMv2 and LayoutLMv3. Would this be a better tool to use? You also suggested Eynollah for text-segmentation.

I can see the value in training or fine-tuning the text recognition part on Hebrew and Yiddish models that already exist (e.g. avichr/heBERT

Thank you (and others in this community) for any assistance. I am a novice, so code samples/explanations are very helpful.

Topic		Replies	Views
Which Model Should I Choose: TrOCR, TrOCR + LayoutLM, or Donut? Beginners	1	282	March 12, 2025
Handwriting recognition. Can't recognize multiline words Beginners	7	2695	May 14, 2023
Need Help Separating PDF Content into Paragraphs Using OCR Beginners	0	358	March 14, 2024
Fine-tuning TrOCR on new language 🤗Transformers	4	2330	April 10, 2025
Looking for OCR post-processing for Visual Document Understanding Research	0	636	December 15, 2023

How to do full page analysis with TrOCR (integrating with text segmentation analysis)

Related topics