Training a language model on Arabic data - handling right-to-left text direction

I am currently working on training a language model on Arabic data and I have encountered an issue with handling the right-to-left text direction. Specifically, I am unsure whether I should reverse the text in my training data to accommodate for the right-to-left direction, or if this would affect the quality of the resulting language model.

As Arabic is a right-to-left language, the order of the words in a sentence is reversed compared to left-to-right languages like English. Therefore, I am concerned that reversing the text could change the meaning of the sentences and produce incorrect results. On the other hand, if I do not reverse the text, the language model may not be able to learn the correct relationships between the words and may not perform well on real-world data.

I would appreciate any advice or insights on how to handle right-to-left text direction when training a language model on Arabic data. Specifically, I would like to know:

  • Should I reverse the text in my training data or not?
  • If I do not reverse the text, how can I ensure that the language model is able to understand the correct relationships between words and produce accurate results?
  • Are there any other considerations or best practices I should be aware of when training a language model on Arabic data with right-to-left text direction?
1 Like

Yes, we need an answer to this very important question! If, during training, we pass is a RTL image/text pair to a ViT encoder-decoder (e.g. TrOCR microsoft/trocr-base-printed ยท Hugging Face), do we reverse the text description so the first character on the far right image becomes the first character on the far left text description? @nielsr - Any guidance please?

Example in Hebrew:
Text: ืฉืœื•ื (<-- read RTL)
Image:
2023-06-18 16.28.12-Google Translate-host-5cefcaa3b8d9

Does the encoder read the image LTR (in which case we should reverse out text description string)?

Please help us folks working on Arabic, Urdu, Farsi, Hebrew models!

@abdoelsayed Couple of clarifications needed:

Are you building a brand new model of your own?
OR

Do you just want to train a model on Arabic text (without worrying about how it will handle the RTL problem), in which case have you tried using the tiiuae/falcon-40b model?

Thanks.

This will help: 9.1. Working with Sequences โ€” Dive into Deep Learning 1.0.3 documentation