PreTrain T5 for Italian 🇮🇹

gsarti · June 26, 2021, 3:10pm

Train T5 from scratch in Italian on mC4

Currently the Italian Hub has only autoencoding (e.g. bert-base-italian , BERTino) and autoregressive (e.g. GePpeTto) language models. For this project, the goal is to create a strong Text2TextGeneration model for seq2seq tasks in Italian.

Model

We’ll be using a random T5 model with a configuration similar to the one of t5-base, since the Flax implementation was recently made available on the Hub.

Datasets

Recently the multilingual C4 dataset was made available, and it surpasses by a large margin the size of the OSCAR corpus (590GB cleaned vs. 69GB). Given the new streaming capabilities for large datasets this should be easily feasible. Another option is to use multiple large datasets (mC4, OSCAR, Italian Wikipedia) with ìnterleave_datasets to mix them.

Training scripts

We can leverage the T5 pretraining script that is currently being prepared to train the model, and run_summarization_flax.py to test downstream performances, e.g. on SQUAD-it converted in text2text format.

Challenges

The project involves a pre-training of a large LM, so it will inevitably take some time to reach competitive performances. Most scripts and implementations are very new, and as such it is possible that we’ll incur in some issues to be fixed.

Desired project outcome

The main project outcome, besides producing a trained T5 model for Italian, is to have fun and get our hands dirty with some :jax:! We could optionally test the validity of our results at a later stage by fine-tuning the model on some downstream tasks.

Reads

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

patrickvonplaten · June 29, 2021, 2:36pm

Think this is a cool project - let’s see if we get some more participants

gsarti · July 5, 2021, 6:04pm

Any chance to get some compute for this one? Everything should work out of the box with scripts, so I think I can handle it solo if TPUs are available! @patrickvonplaten

patrickvonplaten · July 7, 2021, 12:36pm

Let’s add it!

Topic		Replies	Views
Pretrain T5 for Arabic Flax/JAX Projects	17	2687	June 11, 2023
Pretrain T5 from scratch in Dutch Flax/JAX Projects	2	2093	July 7, 2021
RobIt : PreTrain RoBERTa-base from scratch in Italian Flax/JAX Projects	4	478	June 29, 2021
PreTrain T5 from scratch in Bengali Flax/JAX Projects	5	2207	July 26, 2022
Pretrain and Fine Tune Byte-level model for multilingual extractive QA (Like ByT5) Flax/JAX Projects	13	1986	July 2, 2021