Pretrain T5 for Arabic

T5 for Arabic

Currently there is a fair amount of Encoder-only and Decoder-only models for Arabic (AraBERT, AraElectra, AraGPT2, etc.), but there aren’t any seq2seq models. The goal of this project is to pretrain a T5 language model for the Arabic language.

Model

A randomly initialized T5 model. I’m not sure yet what the biggest model size that can be trained during a one week period however. Ideally, I think we should train the biggest variant possible, given that bigger models converge faster.

Datasets

The Arabic subset of mC4 should be more than enough though. It contains 57 Billion tokens (according to the mT5 paper) and the English T5 model was trained on 2^{35} (about 34 Billion) tokens (T5 paper).

Training scripts

We can use the recently released T5 pretraining script

Challenges

I have never worked with datasets this huge.

Desired outcomes

  • An Arabic T5 model with reasonable results when fine-tuned for summarization, question generation, question answering and so on.
  • Learning some JAX and having fun :grinning_face_with_smiling_eyes:

Team members:

  • ManarAli
  • Onyx
  • salti
1 Like

I am very interested in this project.
Are you still looking for partners?

Yes! I was actually the only member so far, so I definitely could use some help.

1 Like

I’m also interested in joining.

Great! I’ll update the post with the current team members.

Let’s create it :slight_smile:

I’m so interested in joining.
I Just want to use this way for Persian/Farsi Language.
Are you still looking for partners?

I’m not sure this project will be very useful for Persian. It’s much better to to pretrain a new model from scratch for Persian directly. The mc4 dataset also has a Persian subset which you can use for pretraining.