PreTrain T5 for Italian 🇮🇹

Train T5 from scratch in Italian on mC4

Currently the Italian Hub has only autoencoding (e.g. bert-base-italian , BERTino) and autoregressive (e.g. GePpeTto) language models. For this project, the goal is to create a strong Text2TextGeneration model for seq2seq tasks in Italian.

Model

We’ll be using a random T5 model with a configuration similar to the one of t5-base, since the Flax implementation was recently made available on the Hub.

Datasets

Recently the multilingual C4 dataset was made available, and it surpasses by a large margin the size of the OSCAR corpus (590GB cleaned vs. 69GB). Given the new streaming capabilities for large datasets this should be easily feasible. Another option is to use multiple large datasets (mC4, OSCAR, Italian Wikipedia) with ìnterleave_datasets to mix them.

Training scripts

We can leverage the T5 pretraining script that is currently being prepared to train the model, and run_summarization_flax.py to test downstream performances, e.g. on SQUAD-it converted in text2text format.

Challenges

The project involves a pre-training of a large LM, so it will inevitably take some time to reach competitive performances. Most scripts and implementations are very new, and as such it is possible that we’ll incur in some issues to be fixed.

Desired project outcome

The main project outcome, besides producing a trained T5 model for Italian, is to have fun and get our hands dirty with some :jax:! We could optionally test the validity of our results at a later stage by fine-tuning the model on some downstream tasks.

Reads

1 Like

Think this is a cool project - let’s see if we get some more participants :slight_smile:

1 Like

Any chance to get some compute for this one? :slightly_smiling_face: Everything should work out of the box with scripts, so I think I can handle it solo if TPUs are available! @patrickvonplaten

1 Like

Let’s add it!

1 Like