Fine-tuning TrOCR on new language

Hi everyone,

I’m currently fine-tuning TrOCR for ancient handwritten texts in Spanish, utilizing the “microsoft/trocr-large-handwritten” as processor and model, which has yielded outstanding results. However, I’ve been contemplating whether we could achieve even better results by incorporating specific language models. In some discussions, such as the one found here: Fine tune TrOCR using bert-base-multilingual-cased · Issue #15823 · huggingface/transformers · GitHub, recommendations for training on another language model are like that:

feature_extractor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-384")
decoder_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
processor = TrOCRProcessor(feature_extractor=feature_extractor, tokenizer=decoder_tokenizer)


model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained("google/vit-base-patch16-384", "xlm-roberta-base")

However, this approach appears to be more of a “from scratch” training method rather than fine-tuning. What if we wish to leverage the pre-trained weights from the existing powerful TrOCR model for handwriting (“microsoft/trocr-large-handwritten”) in conjunction with a larger model, such as “xlm-roberta-large-finetuned-conll02-spanish” or any other Spanish RoBERTa model? Would this be a viable strategy? If so, what might be an effective combination? Maybe something like this:

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-handwritten")
processor.tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-handwritten")

Thank you in advance for your help and insights.