Pretrain T5 for Arabic

T5 for Arabic

Currently there is a fair amount of Encoder-only and Decoder-only models for Arabic (AraBERT, AraElectra, AraGPT2, etc.), but there aren’t any seq2seq models. The goal of this project is to pretrain a T5 language model for the Arabic language.

Model

A randomly initialized T5 model. I’m not sure yet what the biggest model size that can be trained during a one week period however. Ideally, I think we should train the biggest variant possible, given that bigger models converge faster.

Datasets

The Arabic subset of mC4 should be more than enough though. It contains 57 Billion tokens (according to the mT5 paper) and the English T5 model was trained on 2^{35} (about 34 Billion) tokens (T5 paper).

Training scripts

We can use the recently released T5 pretraining script

Challenges

I have never worked with datasets this huge.

Desired outcomes

  • An Arabic T5 model with reasonable results when fine-tuned for summarization, question generation, question answering and so on.
  • Learning some JAX and having fun :grinning_face_with_smiling_eyes:

Team members:

  • ManarAli
  • Onyx
  • salti
1 Like

I am very interested in this project.
Are you still looking for partners?

Yes! I was actually the only member so far, so I definitely could use some help.

1 Like

I’m also interested in joining.

Great! I’ll update the post with the current team members.

Let’s create it :slight_smile:

I’m so interested in joining.
I Just want to use this way for Persian/Farsi Language.
Are you still looking for partners?

I’m not sure this project will be very useful for Persian. It’s much better to to pretrain a new model from scratch for Persian directly. The mc4 dataset also has a Persian subset which you can use for pretraining.

Does mt5 support Arabic?

Yes, looking at page 12 of mT5’s original paper, Arabic is indeed included in the pre-training corpus (57 billion tokens, 53 million pages, or 1.66 % of the total pre-training data).

OK. Is that meaning; I can finetune mt5 to my Arabic dataset on task-oriented dialogue system tasks?

Yes. You can fine-tune T5 models on virtually any task if you formulate it as a text-to-text task.

OK. thank you

any thing i can help with you guys ?

Is the team completed, or still need members?

Any updates on this ? this is the best time to make it work. Lets revive this if not yet finished

Hello Zakarya, the trained model was published on the hub right after the event concluded two years ago.
You can find it here, I also fine-tuned the model for question paraphrasing.
The results are not bad but also not that great, it’s because we couldn’t make the training work until the last day or so of the event.
Let me know if you have interest in re-training the model or any other ideas.

Hi Salti,

That’s really good, I would love to play with the model and further train or finetune it.

If you have LinkedIn we can connect there. Here is the link to my profile: httpe://www.linkedin.com/in/zakarya-alsalahi

Regards,
Zak