DONUT: Reading order for pseudo-OCR pre-training task

muibk · January 16, 2025, 11:20am

I would like to train the Donut base model for a few more epochs on the pre-training pseudo-OCR task using a custom dataset. In what reading order should the individual words of the document image be passed to the model? The Donut paper states:

The model is trained to read all texts in the image in reading order (from top-left to bottom-right, basically). […] This task can be interpreted as a pseudo-OCR task.

What does “top-left to bottom-right” mean for multi-column text? For instance, consider the attached dummy document with one heading and two text columns:

Should the document be transcribed as:

Word1 Col1w1 Col1w2 Col2w1 Col2w2, or
Word1 Col1w1 Col2w1 Col1w2 Col2w2 ?

I imagine that any dataset used for the pre-training pseudo-OCR task should adopt the same reading order policy as the pe-trained Donut base model. Unfortunately, I am not able to find any information of the exact implementation of “top-left to bottom-right”, neither in the paper, the paper supplement, nor the source code.

Topic		Replies	Views
Donut fine tuning question 🤗Optimum	0	1624	October 16, 2023
Donut base-sized model, pre-trained only for a new language tutorial Models	2	1043	February 19, 2023
[DONUT] Typo errors - Document parsing 🤗Transformers	1	520	September 10, 2024
Creating custom Donut model Models	0	713	March 16, 2023
Can Donut model be used to query Multipage documents? 🤗Transformers	3	1519	January 29, 2025

DONUT: Reading order for pseudo-OCR pre-training task

Related topics