I’m trying to train Donut to parse invoices. I’ve already trained Donut on a similar use case, in which I generated a dataset of JSON object containing bank statement info, and used a number of docx templates populated with the generated JSONs to generate images of the invoices. Training on this synthetic dataset of (image, json) worked very well, probably in large part because I further finetuned the naver-clova-ix/donut-base-finetuned-cord-v2 instance of Donut that is finetuned on similar images.
Now I’m trying a similar approach. However, these invoices are in Greek. Simply adding the Greek alphabet as new tokens resulted in poor tokenization of the JSON labels, with many <unk> tokens and spaces in random places. I’ve concluded that I need to train a new tokenizer, following Training a new tokenizer from an old one - Hugging Face Course, on a corpus of greek-populated JSON objects.
I’m not sure what the best finetuning approach here is. I’m hoping to get some transfer from naver-clova-ix/donut-base-finetuned-cord-v2, but if I’m replacing the tokenizer, that means I need to train at least the decoder essentially from scratch, right? I could use some advice on setting learning rates - is it possible to set a small learning rate for Donut’s vision encoder, but a large one for the decoder? How much data and training time can I expect to need to train Donut in this way?
Any help appreciated! Would greatly appreciate any insight from @nielsr as well.
if I’m replacing the tokenizer, that means I need to train at least the decoder essentially from scratch, right?
That’s correct. Donut’s decoder uses an English vocabulary of tokens, so it doesn’t know any Greek tokens.
One possibility is to instantiate a VisionEncoderDecoderModel class with the weights of the vision encoder of naver-clova-ix/donut-base-finetuned-cord-v2 for the encoder, and the weights of a Greek pre-trained language model for the decoder (like this one). Next, you can fine-tune this model on (image, Greek sequence) pairs.
Thank you for the response. I tried training a new tokenizer, and after training Donut with the new tokenizer, I observed that the output has the right syntax, and uses the right tokens, but completely ignores the image. I hoped that with more training, the optimizer would be “forced” to start using the encoder, but it has not improved (it has converged to output the same json for every image). I’m guessing that’s the wrong way to think about it - I don’t fully understand how the embedding from the encoder and the decoder input ids are supposed to interact or what makes them compatible or not - can you perhaps explain how this works?
Anyway, I will try your suggestion and hope for better results, thanks again