Please Help! How to properly label RTL ground truth data for fine-tuning/training ViT models

jhhf · June 29, 2023, 6:28pm

Hello everyone,

I am a beginner working to fine-tune/train my first TrOCR model.

It is not clear to me how to properly label ground truth training data for RTL languages?

Supposing I have the following image in Hebrew:

shalom

Should my ground truth LABEL be:

שלום

or

םולש

?

In @nielsr Seq2Seq e.g. at https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb, this would be the series labels under the "text’ column pandas df in out[3]

I am using the following for my fine-truning:

model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
“facebook/deit-base-distilled-patch16-384”, “HeNLP/HeRo”
)

Would someone guide me on how encoder/decoder transformers work with RTL scripts. cc @NielsR, @Norod78: is there a setting I use to tell a model that the labels are RTL?

Thank you,

sivan22 · July 26, 2023, 5:14am

hi there, I was working on a project similar to yours (Hebrew trOcr) and in my experience, there is no special setting for RTL, I just used the seq2seq script. by the meantime I’ve succeeded to over fit on few lines, but I’m stuck without a good quality dataset.

if you are interested, I would share my project on GitHub.

jhhf · July 28, 2023, 9:23pm

Thanks @sivan22. I take it, then, that you are NOT reversing the ground truth data (i.e. “שלום” becoming “םולש”)? How does your model know how to compare ground truth against the image? Does it just read/encode/decode everything LTR, text or image? And what are you getting for CER?

There is a list of Hebrew datasets at GitHub - NNLP-IL/Hebrew-Resources: A comprehensive list of Hebrew NLP resources..

sivan22 · July 30, 2023, 7:38pm

hi @jhhf, as i understand, it is just a sequence of tokens, one after another, that are the result of the model processing the image, so you could read these tokens from right to left if you wish, and the model will just be trained to understand RTL writing.
and you must do it this way, to match the heBert (or any other) model that was pretrained on RTL hebrew, of course.

as i mentioned, I’ve overfitted a very small dataset (5 items) with 0.05 CER just to test that everything works. then i tried a bigger dataset (30k synth by TRDG) and got some poor results (i don’t remember but i think about ~0.7) .
i then read that microsoft traind their TrOCR on 300M lines (!) dataset.

actually I was missing a hebrew handwriting dataset.

jhhf · August 1, 2023, 12:29am

Appreciate the help @sivan22. I also used Hebrew synth text from TRDG, and my CER was terrible. Ran a few epochs over several sessions of 100 to 700 images, and best CER in any run was only 0.83. I stuck to serif/sans-serif characters and fairly regular non-nikud fonts, to try to minimize problems. Tried both “avichr/heBERT” (avichr/heBERT · Hugging Face) and “HeNLP/HeRo” (HeNLP/HeRo · Hugging Face) as my tokenizer/decoder with “facebook/deit-base-distilled-patch16-384” as my encoder.

Mostly RTL with a run of LTR output for experimentation.

Any tips would be appreciated.

I’d like to do handwriting, but don’t feel I can move forward until I solve the poor serif/sans-serif CER.

sivan22 · August 2, 2023, 4:03am

I think that we should try a much bigger dataset and see if it helps.

on the other hand, it might be the LM’s problem, as I know that the Hebrew models cannot be compared to the English ones. in that case we need to wait to a release of a better model.

the good news is that the Israeli government is pushing hard to create a compareble model in Hebrew.

e.g. in the trOCR paper they described using of a lot a pdf files because the pdf format is built in with text and image. but I don’t have a tool to do this extraction.

good luck.

sivan22 · August 7, 2023, 11:42am

hi @jhhf I’ve just looked in my w&b report of the last run and found it interesting. here it is:

every jumping on the loss is an epoch. so it looks like an overfitting problem. the val cer is little lower than your report.

jhhf · August 14, 2023, 4:29pm

Thanks again for your insights @sivan22. I have been trying different tokenizer/decoders as well as different seq2seqtrainer settings and will report back to you when I get some numbers. Did get a 0.63 CER so moving in the right direction!

sivan22 · August 14, 2023, 5:22pm

I’m sorry, this is the right image

sivan22 · September 2, 2023, 8:02pm

@jhhf try using this fresh SOTA BERT dicta-il/dictabert · Hugging Face

jhhf · September 13, 2023, 3:31pm

Thanks @sivan22. I have managed to get my CER down to 0.28 through increased training and LR adjustment, and will do a parallel test on the dictabert model you suggest. Appreciate that you continue to provide some solutions.

Topic		Replies	Views
Training a language model on Arabic data - handling right-to-left text direction Models	4	2130	September 26, 2024
Fine tune Transformers for text generation 🤗Transformers	11	12007	July 27, 2023
Fine tune with SFTTrainer Intermediate	17	14123	September 12, 2024
Fine tuning a sentence transformer model for [single_sentence, label] format? 🤗Transformers	0	505	February 13, 2023
VisionEncoderDecoder/TrOCR Models	0	702	October 21, 2021

Please Help! How to properly label RTL ground truth data for fine-tuning/training ViT models

Related topics