Donut Pre-Train on new Language

Hi Folks,

thank you for offering the Donut model through Huggingface!

I followed the great blogpost from Philipp ( Document AI: Fine-tuning Donut for document-parsing using Hugging Face Transformers (philschmid.de)) for the fine tuning.

However, I would like to train Donut on a new language before I fine-tune the model. As I have read, there is little difference between the initial pre-training of Donut and the fine-tuning. However, I wonder why so far only the training via the command line is specified on the official repo: python train.py --config config/base.yaml --exp_version "base".

My question: can I change the code from the blogpost so that I can re-train / further fine-tune the model on a unseen language (i.e. the task of predicting the words on an image) with the {"file_name": "image_138.jpg", "ground_truth": "{\"gt_parse\": {\"text_sequence\": \"... \"}}"} format?

If this is possible, what do I need to change? Some code snippets or hints would be super helpful!

@amyeroberts Philipp told me you might help me with this question?

1 Like

Hi @ChrisDelClea, thanks for your question! I’m not super familiar with the finetuning code, but I can try and help :slight_smile:

It’s certainly possible to fine tune the Donut model on a new language. With regards to the question:

can I change the code from the blogpost so that I can re-train / further fine-tune the model on a unseen language (i.e. the task of predicting the words on an image) with the {"file_name": "image_138.jpg", "ground_truth": "{\"gt_parse\": {\"text_sequence\": \"... \"}}"} format?

are you asking what changes are required to take a dataset shown in the official repo to use it in the fine-tuning blog post?

In addition to getting the code to run with a new dataset, there might be other considerations too. For example, the tokenizer used might have had its vocabulary and tokens based on only English inputs.

Hi @amyeroberts!

Yes exactly that dataset format for the reading task:

#### For (Pseudo) Text Reading Task
The `gt_parse` looks like `{"text_sequence" : "word1 word2 word3 ... "}`
- This task is also a pre-training task of Donut model.

How would i have to change the code from the blogpost to train on that format?

Regarding the tokenizer: Is there a way to change the used tokenizer inside DeonutProcessor in my fine-tuning code on new language?

I want to replace the following tokenizer to a german one:

donut/model.py at e6623ad56c0e9f12a426dab2d8b2d65a39d64689 · clovaai/donut (github.com)

Does that make sense or is it not necessary at all? Sorry for the silly questions. Would help me a lot if you can tell me!

3 Likes

Hi @ChrisDelClea
I have the same objective, which is pre-train Donut on a new language (Arabic in my case), I don’t know if u succeded to do so.
If u have any instructions/code on this matter I’ll be very happy to know.

1 Like