Train T5 from scratch in Italian on mC4
Currently the Italian Hub has only autoencoding (e.g. bert-base-italian
, BERTino
) and autoregressive (e.g. GePpeTto
) language models. For this project, the goal is to create a strong Text2TextGeneration
model for seq2seq tasks in Italian.
Model
We’ll be using a random T5 model with a configuration similar to the one of t5-base
, since the Flax implementation was recently made available on the Hub.
Datasets
Recently the multilingual C4 dataset was made available, and it surpasses by a large margin the size of the OSCAR corpus (590GB cleaned vs. 69GB). Given the new streaming capabilities for large datasets this should be easily feasible. Another option is to use multiple large datasets (mC4, OSCAR, Italian Wikipedia) with ìnterleave_datasets
to mix them.
Training scripts
We can leverage the T5 pretraining script that is currently being prepared to train the model, and run_summarization_flax.py
to test downstream performances, e.g. on SQUAD-it converted in text2text format.
Challenges
The project involves a pre-training of a large LM, so it will inevitably take some time to reach competitive performances. Most scripts and implementations are very new, and as such it is possible that we’ll incur in some issues to be fixed.
Desired project outcome
The main project outcome, besides producing a trained T5 model for Italian, is to have fun and get our hands dirty with some :jax:! We could optionally test the validity of our results at a later stage by fine-tuning the model on some downstream tasks.