How to do full page analysis with TrOCR (integrating with text segmentation analysis)

Hello @nielsr, I am absolute beginner looking to OCR/HTR both printed and hand-written Hebrew and Yiddish letters.

In (Github TrOCR) you mention “You still need a separate text detection model to get all single-line texts from a PDF.”

Would you please point me to sample Jupyter notebooks/code that will help me understand:

a) how to take Page XML (or other) bounding-box output to send to TrOCR?

b) other methods to line segment a page to then send that information to TrOCR?

If one is looking to ocr/htr full pages, should one even consider TrOCR? You posted some notebooks for LayoutLMv2 and LayoutLMv3. Would this be a better tool to use? You also suggested Eynollah for text-segmentation.

I can see the value in training or fine-tuning the text recognition part on Hebrew and Yiddish models that already exist (e.g. avichr/heBERT

Thank you (and others in this community) for any assistance. I am a novice, so code samples/explanations are very helpful.

1 Like