NLP in Arabic with HF and Beyond
Overview
Arabic language consists of 28 basic letters in addition to extra letters that can be concatenated with Hamza (ء) like أ ، ؤ ، ئ that are used to make emphasis on the letter. Moreover, there are special characters called diacritics to compensate for the lack of short vowels in the language. This increases the number of letters to reach over 40 if we consider diacritics. In Arabic writing script, the words are usually connected together (ligature) which makes it challenging for many tasks like OCR. From a language construction perspective, Arabic is morphologically rich where multiple prefixes, suffixes and infixes can be added to a single stem. There are two types of morphology in Arabic: derivational and inflectional. Derivational morphology is common in Arabic which allows deriving multiple words from a single stem. On the other hand inflectional morphology can add multiple morphemes to a single stem to create much longer and complex words. As a result, Arabic verbs can have up to 5,400 different forms. This rich morphology gave rise to completely different and complex dialects around the Arabic world. Moreover, the complexity of the language makes it challenging for language modelling in general.
State of NLP in Arabic
Before the recent advances in NLP there was some focus on the analysis of morphology in Arabic. Notably, a lot of tools have been created for morphological analysis like segmentation : FARASA and MADAMIRA. Later, after the success of word embeddings like Word2Vec, AraVec was created as a very large word embedding for modern standard arabic and dialects. Recently with the advances in transformers many models have been pre-trained for Arabic like BERT based model i.e AraBERT, and improved variants MARBERT & ARBERT, a GPT-2 based model AraGPT-2, ELECTRA based model AraELECTRA and GigaBERT using zero shot cross lingual understanding from English. Not to mention the task specific models for different tasks like dialect identification, fake news detection, sentiment analysis, translation, etc.
Datasets and Models in HF
Currently there are around 50 datasets for Arabic in different tasks like sentiment analysis, text classification, question answering, translation, unsupervised training, localization, etc. All the transformer models in the previous section are available in the huggingface model hub. In addition there are currently over 70 models for Arabic ranging from monolingual to multilingual models which tackle MSA and dialects.
Arabic Transformers in HF
We can use the datasets module to load any available dataset for Arabic. In this example we use the MetRec dataset which contains poems labeled in terms of meters. In the code snippet below we can load the dataset
from datasets import load_dataset
dataset = load_dataset('metrec')
As we can see if we print the variable we see the dataset contains 47,124 records for training and 8,316 records for testing
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 47124
})
test: Dataset({
features: ['text', 'label'],
num_rows: 8316
})
})
Transformers with tokenizers can be used also with ease. Here is a simple example for loading some pretrained Arabic models and tokenizers using transformers
from transformers import AutoTokenizer, AutoModel
PRE_TRAINED_MODEL_NAME = 'aubmindlab/bert-base-arabertv01'
tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
bert_model = AutoModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
You can do some complex tasks like training and fine-tuning as you can see in this tutorial.
Relevant Resources
As the transformers and tools of hugging face expand to multiple languages, the community is building tools that build on top of that. In Arabic, Tnkeeh (تنقيح) is an Arabic text cleaning library that can be used for removing diacritics, normalization and segmentation and many other preprocessing tasks. The library can process instances of the datasets module. Also CAMeL Tools have incorporated transformers in their api that can be used for many tasks like sentiment classification.