Getting poor word acuracy after fine-tuning trocr on Bangla language

I have fine tuned trocr for bangla language, used 1.8M word images (training images) and 0.4M word images for validation. Encoder used microsoft/beit-base-patch16-384 and decoder used is xlm-roberta-base. epoch = 11. parameters obtained for best saved checkpoint is " loss: 0.8127, learning_rate: 4.6280169780882425e-05, epoch: 0.86, step: 20000 eval_loss: 7.68533182144165, eval_cer: 2.3218661516832686, eval_runtime: 14836.3588, eval_samples_per_second: 31.419, eval_steps_per_second: 0.393, epoch: 1.0, step: 23308 ". But when on testing the saved model on 20k seen word images, gives word accuracy of 40.6%. I have followed [https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb]. Can you please tell me what may be the possible reasons for this much bad accuracy on seen data when eval_cer is good? And how can I improve the accuracy of this fine-tuned model? Please help!!!
@sgugger @nielsr @ydshieh @pierreguillou @IdoAmit198
Thanks in advance!

Actually what i found during experimenting with TrOCR is that the accuracy largely depends on which encoder and decoder you’re using. My aim was to go for multilingual ocr which included bengali as well and used 4M synthetically generated sentence dataset. Can you please mention which models you are using as encoder and decoder?

Sorry i skipped the part where you mentioned the models. Beit should work fine as the encoder but try switching out xlm roberta as the decoder with something else. Actually if you go through the architecture of the model in TrOCR base checkpoint you’d find that it’s quite different from Roberta instead quite similar to BART so I would suggest go with some bengali version of Bart.

Thanks for your reply @AnustupOCR …But there is no pre trained model of Bart for Bangla language!

I went through paper and found out that microsoft/trocr-base-stage1 has been trained on BeiT and RoBERTa model. So I used xlm-roberta-base (pre-trained on filtered CommonCrawled data containing 100 languages) in decoder.