Please Help! How to properly label RTL ground truth data for fine-tuning/training ViT models

Hello everyone,

I am a beginner working to fine-tune/train my first TrOCR model.

It is not clear to me how to properly label ground truth training data for RTL languages?

Supposing I have the following image in Hebrew:


Should my ground truth LABEL be:





In @nielsr Seq2Seq e.g. at, this would be the series labels under the "text’ column pandas df in out[3]

I am using the following for my fine-truning:

model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
“facebook/deit-base-distilled-patch16-384”, “HeNLP/HeRo”

Would someone guide me on how encoder/decoder transformers work with RTL scripts. cc @NielsR, @Norod78: is there a setting I use to tell a model that the labels are RTL?

Thank you,

hi there, I was working on a project similar to yours (Hebrew trOcr) and in my experience, there is no special setting for RTL, I just used the seq2seq script. by the meantime I’ve succeeded to over fit on few lines, but I’m stuck without a good quality dataset.

if you are interested, I would share my project on GitHub.

Thanks @sivan22. I take it, then, that you are NOT reversing the ground truth data (i.e. “שלום” becoming “םולש”)? How does your model know how to compare ground truth against the image? Does it just read/encode/decode everything LTR, text or image? And what are you getting for CER?

There is a list of Hebrew datasets at GitHub - NNLP-IL/Hebrew-Resources: A comprehensive list of Hebrew NLP resources..

hi @jhhf, as i understand, it is just a sequence of tokens, one after another, that are the result of the model processing the image, so you could read these tokens from right to left if you wish, and the model will just be trained to understand RTL writing.
and you must do it this way, to match the heBert (or any other) model that was pretrained on RTL hebrew, of course.

as i mentioned, I’ve overfitted a very small dataset (5 items) with 0.05 CER just to test that everything works. then i tried a bigger dataset (30k synth by TRDG) and got some poor results (i don’t remember but i think about ~0.7) .
i then read that microsoft traind their TrOCR on 300M lines (!) dataset.

actually I was missing a hebrew handwriting dataset.

Appreciate the help @sivan22. I also used Hebrew synth text from TRDG, and my CER was terrible. Ran a few epochs over several sessions of 100 to 700 images, and best CER in any run was only 0.83. I stuck to serif/sans-serif characters and fairly regular non-nikud fonts, to try to minimize problems. Tried both “avichr/heBERT” (avichr/heBERT · Hugging Face) and “HeNLP/HeRo” (HeNLP/HeRo · Hugging Face) as my tokenizer/decoder with “facebook/deit-base-distilled-patch16-384” as my encoder.

Mostly RTL with a run of LTR output for experimentation.

Any tips would be appreciated.

I’d like to do handwriting, but don’t feel I can move forward until I solve the poor serif/sans-serif CER.

I think that we should try a much bigger dataset and see if it helps.

on the other hand, it might be the LM’s problem, as I know that the Hebrew models cannot be compared to the English ones. in that case we need to wait to a release of a better model.

the good news is that the Israeli government is pushing hard to create a compareble model in Hebrew.

e.g. in the trOCR paper they described using of a lot a pdf files because the pdf format is built in with text and image. but I don’t have a tool to do this extraction.

good luck.

hi @jhhf I’ve just looked in my w&b report of the last run and found it interesting. here it is:

every jumping on the loss is an epoch. so it looks like an overfitting problem. the val cer is little lower than your report.

Thanks again for your insights @sivan22. I have been trying different tokenizer/decoders as well as different seq2seqtrainer settings and will report back to you when I get some numbers. Did get a 0.63 CER so moving in the right direction!

I’m sorry, this is the right image

@jhhf try using this fresh SOTA BERT dicta-il/dictabert · Hugging Face

Thanks @sivan22. I have managed to get my CER down to 0.28 through increased training and LR adjustment, and will do a parallel test on the dictabert model you suggest. Appreciate that you continue to provide some solutions.