Currently there is a fair amount of Encoder-only and Decoder-only models for Arabic (AraBERT, AraElectra, AraGPT2, etc.), but there aren’t any seq2seq models. The goal of this project is to pretrain a T5 language model for the Arabic language.
Model
A randomly initialized T5 model. I’m not sure yet what the biggest model size that can be trained during a one week period however. Ideally, I think we should train the biggest variant possible, given that bigger models converge faster.
The Arabic subset of mC4 should be more than enough though. It contains 57 Billion tokens (according to the mT5 paper) and the English T5 model was trained on 2^{35} (about 34 Billion) tokens (T5 paper).
I’m not sure this project will be very useful for Persian. It’s much better to to pretrain a new model from scratch for Persian directly. The mc4 dataset also has a Persian subset which you can use for pretraining.
Yes, looking at page 12 of mT5’s original paper, Arabic is indeed included in the pre-training corpus (57 billion tokens, 53 million pages, or 1.66 % of the total pre-training data).
Hello Zakarya, the trained model was published on the hub right after the event concluded two years ago.
You can find it here, I also fine-tuned the model for question paraphrasing.
The results are not bad but also not that great, it’s because we couldn’t make the training work until the last day or so of the event.
Let me know if you have interest in re-training the model or any other ideas.