Training a language model on Arabic data - handling right-to-left text direction

abdoelsayed · May 2, 2023, 4:05pm

I am currently working on training a language model on Arabic data and I have encountered an issue with handling the right-to-left text direction. Specifically, I am unsure whether I should reverse the text in my training data to accommodate for the right-to-left direction, or if this would affect the quality of the resulting language model.

As Arabic is a right-to-left language, the order of the words in a sentence is reversed compared to left-to-right languages like English. Therefore, I am concerned that reversing the text could change the meaning of the sentences and produce incorrect results. On the other hand, if I do not reverse the text, the language model may not be able to learn the correct relationships between the words and may not perform well on real-world data.

I would appreciate any advice or insights on how to handle right-to-left text direction when training a language model on Arabic data. Specifically, I would like to know:

Should I reverse the text in my training data or not?
If I do not reverse the text, how can I ensure that the language model is able to understand the correct relationships between words and produce accurate results?
Are there any other considerations or best practices I should be aware of when training a language model on Arabic data with right-to-left text direction?

jhhf · June 18, 2023, 10:34pm

Yes, we need an answer to this very important question! If, during training, we pass is a RTL image/text pair to a ViT encoder-decoder (e.g. TrOCR microsoft/trocr-base-printed · Hugging Face), do we reverse the text description so the first character on the far right image becomes the first character on the far left text description? @nielsr - Any guidance please?

Example in Hebrew:
Text: שלום (<-- read RTL)
Image:
2023-06-18 16.28.12-Google Translate-host-5cefcaa3b8d9

Does the encoder read the image LTR (in which case we should reverse out text description string)?

Please help us folks working on Arabic, Urdu, Farsi, Hebrew models!

chem1 · June 19, 2023, 10:56am

@abdoelsayed Couple of clarifications needed:

Are you building a brand new model of your own?
OR

Do you just want to train a model on Arabic text (without worrying about how it will handle the RTL problem), in which case have you tried using the tiiuae/falcon-40b model?

Thanks.

Ashu212 · August 26, 2023, 11:13pm

This will help: 9.1. Working with Sequences — Dive into Deep Learning 1.0.3 documentation

Cherryblade29 · September 26, 2024, 10:33pm

hello , so did you find a solution
?

Topic		Replies	Views
Please Help! How to properly label RTL ground truth data for fine-tuning/training ViT models Models	10	591	September 13, 2023
Fine-tuning a text model for another language Beginners	0	329	November 10, 2022
Fine-Tune TrOCR on Arabic Beginners	3	1500	August 24, 2024
Arabic to French Word embedding Using skip-gram needs new Ideas in the data part Intermediate	0	31	April 23, 2025
Addition of a new language (Chadian Arabic ‘shu’) to the NLP, LLM models Beginners	0	10	January 11, 2025

Training a language model on Arabic data - handling right-to-left text direction

Related topics