Fine-tuning TrOCR to do digit recognition in another language

I’m a beginner so I’m sorry if I’m making any wrong assumptions.

Synthetic Dataset Generation

  • If I was building my own model, I would have all the control over the dimensions of the images I use to train my model. But since I’m fine-tuning a pre-trained model, I think there might be constraints when it comes image dimensions and some other aspects. If so, what are these restrictions/constraints I need to be aware of before generating my synthetic dataset?

TrOCR Model

  • Are there TrOCR model components (encoder, decoder, …) or other components like tokenizers that I have to find replacements for, in order to be able to work on my specific problem of Ge’ez digit recognition?
  • The pretrained model (one of the ones trained with English text), I think, has some notion of what each digit looks like. Won’t it create a confusion & cause it to perform terribly when I try to fine-tune it with glyphs from another language?
  • All the TrOCR examples (specifically the ones @nielsr created) I’ve seen are done on single-line text images. But there’s no reason for it to not work on single digit images, right? (eg. Afro-MNIST dataset)

Computational Time & Resources

  • I was going through one of the tutorial notebooks @nielsr created, (thank you so much for those btw) on Google Colab. And using the T4 GPU, fine-tuning the TrOCR model on IAM test set takes a lot of time. It got to the point where Colab told me that I’ve run out of my usage limits. Specially given that the dataset is relatively small (about 3K), I’m worried if I have the resources available to fine-tune the model on a relatively bigger dataset (eg. Afro-MNIST). Are there any strategies I should explore to efficiently utilize the resources Google Colab gives me? Or are there any other free options available?

Thank you.