Donut base-sized model, pre-trained only for a new language tutorial

Hello,

I’m trying to generate a new pre-training for Donut model using Romanian language documents.

I have around 100k scanned documents, including a metadata.jsonl formated as the one that synthdog generates:

{"file_name": "img_1.jpg", "ground_truth": "{\"gt_parse\": {\"text_sequence\": \Text in Romanian language\"}}"}

and i want to create a new pretraining model so that i can fine-tune on it.

I’ve search and i can’t understand how to do it… Can anyone please share the scripts to create a new base model for a new language or create a tutorial on how to generate new Donut pre-train model from scratch?

Thanks

Hello @wyzixg,

You might find this discussion useful Finetune Donut with new tokenizer - Greek

Thanks,
I’ve tried that solution, but it’s not working, if anyone who succeeded pre-training donut from scratch can post the process / scripts used, please help, need fot school…