Currently there is a fair amount of Encoder-only and Decoder-only models for Arabic (AraBERT, AraElectra, AraGPT2, etc.), but there aren’t any seq2seq models. The goal of this project is to pretrain a T5 language model for the Arabic language.
Model
A randomly initialized T5 model. I’m not sure yet what the biggest model size that can be trained during a one week period however. Ideally, I think we should train the biggest variant possible, given that bigger models converge faster.
The Arabic subset of mC4 should be more than enough though. It contains 57 Billion tokens (according to the mT5 paper) and the English T5 model was trained on 2^{35} (about 34 Billion) tokens (T5 paper).
I’m not sure this project will be very useful for Persian. It’s much better to to pretrain a new model from scratch for Persian directly. The mc4 dataset also has a Persian subset which you can use for pretraining.
Yes, looking at page 12 of mT5’s original paper, Arabic is indeed included in the pre-training corpus (57 billion tokens, 53 million pages, or 1.66 % of the total pre-training data).