Hi Folks,
thank you for offering the Donut model through Huggingface!
I followed the great blogpost from Philipp ( Document AI: Fine-tuning Donut for document-parsing using Hugging Face Transformers (philschmid.de)) for the fine tuning.
However, I would like to train Donut on a new language before I fine-tune the model. As I have read, there is little difference between the initial pre-training of Donut and the fine-tuning. However, I wonder why so far only the training via the command line is specified on the official repo: python train.py --config config/base.yaml --exp_version "base"
.
My question: can I change the code from the blogpost so that I can re-train / further fine-tune the model on a unseen language (i.e. the task of predicting the words on an image) with the {"file_name": "image_138.jpg", "ground_truth": "{\"gt_parse\": {\"text_sequence\": \"... \"}}"}
format?
If this is possible, what do I need to change? Some code snippets or hints would be super helpful!
@amyeroberts Philipp told me you might help me with this question?
Hi @ChrisDelClea, thanks for your question! I’m not super familiar with the finetuning code, but I can try and help 
It’s certainly possible to fine tune the Donut model on a new language. With regards to the question:
can I change the code from the blogpost so that I can re-train / further fine-tune the model on a unseen language (i.e. the task of predicting the words on an image) with the {"file_name": "image_138.jpg", "ground_truth": "{\"gt_parse\": {\"text_sequence\": \"... \"}}"} format?
are you asking what changes are required to take a dataset shown in the official repo to use it in the fine-tuning blog post?
In addition to getting the code to run with a new dataset, there might be other considerations too. For example, the tokenizer used might have had its vocabulary and tokens based on only English inputs.
Hi @amyeroberts!
Yes exactly that dataset format for the reading task
:
#### For (Pseudo) Text Reading Task
The `gt_parse` looks like `{"text_sequence" : "word1 word2 word3 ... "}`
- This task is also a pre-training task of Donut model.
How would i have to change the code from the blogpost to train on that format?
Regarding the tokenizer: Is there a way to change the used tokenizer inside DeonutProcessor in my fine-tuning code on new language?
I want to replace the following tokenizer to a german one:
donut/model.py at e6623ad56c0e9f12a426dab2d8b2d65a39d64689 · clovaai/donut (github.com)
Does that make sense or is it not necessary at all? Sorry for the silly questions. Would help me a lot if you can tell me!