However, I would like to train Donut on a new language before I fine-tune the model. As I have read, there is little difference between the initial pre-training of Donut and the fine-tuning. However, I wonder why so far only the training via the command line is specified on the official repo: python train.py --config config/base.yaml --exp_version "base".
My question: can I change the code from the blogpost so that I can re-train / further fine-tune the model on a unseen language (i.e. the task of predicting the words on an image) with the {"file_name": "image_138.jpg", "ground_truth": "{\"gt_parse\": {\"text_sequence\": \"... \"}}"} format?
If this is possible, what do I need to change? Some code snippets or hints would be super helpful!
@amyeroberts Philipp told me you might help me with this question?
Hi @ChrisDelClea, thanks for your question! I’m not super familiar with the finetuning code, but I can try and help
It’s certainly possible to fine tune the Donut model on a new language. With regards to the question:
can I change the code from the blogpost so that I can re-train / further fine-tune the model on a unseen language (i.e. the task of predicting the words on an image) with the {"file_name": "image_138.jpg", "ground_truth": "{\"gt_parse\": {\"text_sequence\": \"... \"}}"} format?
are you asking what changes are required to take a dataset shown in the official repo to use it in the fine-tuning blog post?
In addition to getting the code to run with a new dataset, there might be other considerations too. For example, the tokenizer used might have had its vocabulary and tokens based on only English inputs.
Yes exactly that dataset format for the reading task:
#### For (Pseudo) Text Reading Task
The `gt_parse` looks like `{"text_sequence" : "word1 word2 word3 ... "}`
- This task is also a pre-training task of Donut model.
How would i have to change the code from the blogpost to train on that format?
Regarding the tokenizer: Is there a way to change the used tokenizer inside DeonutProcessor in my fine-tuning code on new language?
I want to replace the following tokenizer to a german one:
Hi @ChrisDelClea
I have the same objective, which is pre-train Donut on a new language (Arabic in my case), I don’t know if u succeded to do so.
If u have any instructions/code on this matter I’ll be very happy to know.