Hello,
I’m trying to generate a new pre-training for Donut model using Romanian language documents.
I have around 100k scanned documents, including a metadata.jsonl formated as the one that synthdog generates:
{"file_name": "img_1.jpg", "ground_truth": "{\"gt_parse\": {\"text_sequence\": \Text in Romanian language\"}}"}
and i want to create a new pretraining model so that i can fine-tune on it.
I’ve search and i can’t understand how to do it… Can anyone please share the scripts to create a new base model for a new language or create a tutorial on how to generate new Donut pre-train model from scratch?
Thanks