Creating custom Donut model

I have a document understanding task and created a dataset containing the images and their ground truth. The documents are in the Bulgarian language. I have tested the donut model, but I don’t have enough images to fine-tune it to understand the Bulgarian language.
After 15 epochs and a little over 600 images (not to mention the time it took to train - >10 hours) I got the “impressive” mean accuracy of: 0.01549120881489085

I have read about VisionEncoderDecoderModels and my impression is that I can use it to create a custom version of Donut, like using Swin model for the encoder and Bert (instead of Bart) for the decoder. The idea is that the Bert model has versions trained in my language. Is my understanding correct? Moreover, I watched a tutorial about working with VisionEncoderDecoderModels but would appreciate more insights. Specifically, I am not sure how to deal with the processor, which processor to add, do I need to do some other steps before initializing it.