Pretrain T5 for Arabic

salti · July 2, 2021, 9:07pm

T5 for Arabic

Currently there is a fair amount of Encoder-only and Decoder-only models for Arabic (AraBERT, AraElectra, AraGPT2, etc.), but there aren’t any seq2seq models. The goal of this project is to pretrain a T5 language model for the Arabic language.

Model

A randomly initialized T5 model. I’m not sure yet what the biggest model size that can be trained during a one week period however. Ideally, I think we should train the biggest variant possible, given that bigger models converge faster.

Datasets

The Arabic subset of the mC4 dataset
The Arabic subset of the Oscar dataset

The Arabic subset of mC4 should be more than enough though. It contains 57 Billion tokens (according to the mT5 paper) and the English T5 model was trained on 2^{35} (about 34 Billion) tokens (T5 paper).

Training scripts

We can use the recently released T5 pretraining script

Challenges

I have never worked with datasets this huge.

Desired outcomes

An Arabic T5 model with reasonable results when fine-tuned for summarization, question generation, question answering and so on.
Learning some JAX and having fun

Team members:

ManarAli
Onyx
salti

ManarAli · July 3, 2021, 11:11pm

I am very interested in this project.
Are you still looking for partners?

salti · July 4, 2021, 1:51pm

Yes! I was actually the only member so far, so I definitely could use some help.

Onyx · July 5, 2021, 8:06pm

I’m also interested in joining.

salti · July 5, 2021, 9:15pm

Great! I’ll update the post with the current team members.

patrickvonplaten · July 7, 2021, 10:35am

Let’s create it

ghofrani · July 7, 2021, 12:40pm

I’m so interested in joining.
I Just want to use this way for Persian/Farsi Language.
Are you still looking for partners?

salti · July 7, 2021, 4:48pm

I’m not sure this project will be very useful for Persian. It’s much better to to pretrain a new model from scratch for Persian directly. The mc4 dataset also has a Persian subset which you can use for pretraining.

AFA · August 23, 2021, 11:02pm

Does mt5 support Arabic?

nielsr · August 24, 2021, 7:12am

Yes, looking at page 12 of mT5’s original paper, Arabic is indeed included in the pre-training corpus (57 billion tokens, 53 million pages, or 1.66 % of the total pre-training data).

AFA · August 24, 2021, 11:53am

OK. Is that meaning; I can finetune mt5 to my Arabic dataset on task-oriented dialogue system tasks?

salti · August 24, 2021, 12:25pm

Yes. You can fine-tune T5 models on virtually any task if you formulate it as a text-to-text task.

AFA · August 24, 2021, 1:53pm

OK. thank you

Mohamed-Aziz · September 29, 2021, 9:20pm

any thing i can help with you guys ?

Arij · October 28, 2021, 10:13am

Is the team completed, or still need members?

hkzak · June 2, 2023, 11:03am

Any updates on this ? this is the best time to make it work. Lets revive this if not yet finished

salti · June 10, 2023, 8:42pm

Hello Zakarya, the trained model was published on the hub right after the event concluded two years ago.
You can find it here, I also fine-tuned the model for question paraphrasing.
The results are not bad but also not that great, it’s because we couldn’t make the training work until the last day or so of the event.
Let me know if you have interest in re-training the model or any other ideas.

hkzak · June 11, 2023, 3:37am

Hi Salti,

That’s really good, I would love to play with the model and further train or finetune it.

If you have LinkedIn we can connect there. Here is the link to my profile: httpe://www.linkedin.com/in/zakarya-alsalahi

Regards,
Zak

Topic		Replies	Views
PreTrain T5 for Italian 🇮🇹 Flax/JAX Projects	3	618	July 7, 2021
Pretrain T5 from scratch in Chinese Flax/JAX Projects	1	818	July 7, 2021
Pretrain T5 from scratch in Dutch Flax/JAX Projects	2	2093	July 7, 2021
Pre-train ALBERT from scratch for Persian/Farsi language Flax/JAX Projects	5	1398	July 10, 2021
PreTrain T5 from scratch in Bengali Flax/JAX Projects	5	2207	July 26, 2022