Donut base-sized model, pre-trained only for a new language tutorial

wyzixg · February 14, 2023, 3:28pm

Hello,

I’m trying to generate a new pre-training for Donut model using Romanian language documents.

I have around 100k scanned documents, including a metadata.jsonl formated as the one that synthdog generates:

{"file_name": "img_1.jpg", "ground_truth": "{\"gt_parse\": {\"text_sequence\": \Text in Romanian language\"}}"}

and i want to create a new pretraining model so that i can fine-tune on it.

I’ve search and i can’t understand how to do it… Can anyone please share the scripts to create a new base model for a new language or create a tutorial on how to generate new Donut pre-train model from scratch?

Thanks

Inesence · February 17, 2023, 10:59am

Hello @wyzixg,

You might find this discussion useful Finetune Donut with new tokenizer - Greek

wyzixg · February 19, 2023, 8:45am

Thanks,
I’ve tried that solution, but it’s not working, if anyone who succeeded pre-training donut from scratch can post the process / scripts used, please help, need fot school…

Topic		Replies	Views
Donut Pre-Train on new Language 🤗Transformers	4	2426	July 1, 2025
Finetune Donut with new tokenizer Intermediate	6	2766	October 10, 2023
Creating custom Donut model Models	0	730	March 16, 2023
Fine-tune a translation model on monolingual data Intermediate	1	447	June 16, 2022
Does donut suite for document struct? Beginners	0	207	July 14, 2023

Donut base-sized model, pre-trained only for a new language tutorial

Related topics