Donut Pre-Train on new Language

ChrisDelClea · April 12, 2023, 3:32pm

Hi Folks,

thank you for offering the Donut model through Huggingface!

I followed the great blogpost from Philipp ( Document AI: Fine-tuning Donut for document-parsing using Hugging Face Transformers (philschmid.de)) for the fine tuning.

However, I would like to train Donut on a new language before I fine-tune the model. As I have read, there is little difference between the initial pre-training of Donut and the fine-tuning. However, I wonder why so far only the training via the command line is specified on the official repo: python train.py --config config/base.yaml --exp_version "base".

My question: can I change the code from the blogpost so that I can re-train / further fine-tune the model on a unseen language (i.e. the task of predicting the words on an image) with the {"file_name": "image_138.jpg", "ground_truth": "{\"gt_parse\": {\"text_sequence\": \"... \"}}"} format?

If this is possible, what do I need to change? Some code snippets or hints would be super helpful!

@amyeroberts Philipp told me you might help me with this question?

amyeroberts · April 13, 2023, 5:18pm

Hi @ChrisDelClea, thanks for your question! I’m not super familiar with the finetuning code, but I can try and help

It’s certainly possible to fine tune the Donut model on a new language. With regards to the question:

can I change the code from the blogpost so that I can re-train / further fine-tune the model on a unseen language (i.e. the task of predicting the words on an image) with the {"file_name": "image_138.jpg", "ground_truth": "{\"gt_parse\": {\"text_sequence\": \"... \"}}"} format?

are you asking what changes are required to take a dataset shown in the official repo to use it in the fine-tuning blog post?

In addition to getting the code to run with a new dataset, there might be other considerations too. For example, the tokenizer used might have had its vocabulary and tokens based on only English inputs.

ChrisDelClea · April 14, 2023, 7:02am

Hi @amyeroberts!

Yes exactly that dataset format for the reading task:

#### For (Pseudo) Text Reading Task
The `gt_parse` looks like `{"text_sequence" : "word1 word2 word3 ... "}`
- This task is also a pre-training task of Donut model.

How would i have to change the code from the blogpost to train on that format?

Regarding the tokenizer: Is there a way to change the used tokenizer inside DeonutProcessor in my fine-tuning code on new language?

I want to replace the following tokenizer to a german one:

donut/model.py at e6623ad56c0e9f12a426dab2d8b2d65a39d64689 · clovaai/donut (github.com)

Does that make sense or is it not necessary at all? Sorry for the silly questions. Would help me a lot if you can tell me!

oual99 · November 3, 2023, 8:15pm

Hi @ChrisDelClea
I have the same objective, which is pre-train Donut on a new language (Arabic in my case), I don’t know if u succeded to do so.
If u have any instructions/code on this matter I’ll be very happy to know.

Jaw00 · July 1, 2025, 9:16am

hi @oual99
i’am currently working on the same objective: pre-train Donut on Arabic .
If you have succeeded to do so can you please share any tips /code that you have used?thank you.

Topic		Replies	Views
Donut fine tuning question 🤗Optimum	0	1629	October 16, 2023
Donut base-sized model, pre-trained only for a new language tutorial Models	2	1047	February 19, 2023
Finetuning Donut Transformer on DocParsing Beginners	0	856	October 23, 2023
Different model performance after saving and loading Donut model 🤗Transformers	1	352	July 6, 2024
Finetune Donut with new tokenizer Intermediate	6	2588	October 10, 2023

Donut Pre-Train on new Language

Related topics