Currently the Italian Hub has only autoencoding (e.g.
BERTino) and autoregressive (e.g.
GePpeTto) language models. For this project, the goal is to create a strong
Text2TextGeneration model for seq2seq tasks in Italian.
We’ll be using a random T5 model with a configuration similar to the one of
t5-base, since the Flax implementation was recently made available on the Hub.
Recently the multilingual C4 dataset was made available, and it surpasses by a large margin the size of the OSCAR corpus (590GB cleaned vs. 69GB). Given the new streaming capabilities for large datasets this should be easily feasible. Another option is to use multiple large datasets (mC4, OSCAR, Italian Wikipedia) with
ìnterleave_datasets to mix them.
We can leverage the T5 pretraining script that is currently being prepared to train the model, and
run_summarization_flax.py to test downstream performances, e.g. on SQUAD-it converted in text2text format.
The project involves a pre-training of a large LM, so it will inevitably take some time to reach competitive performances. Most scripts and implementations are very new, and as such it is possible that we’ll incur in some issues to be fixed.
The main project outcome, besides producing a trained T5 model for Italian, is to have fun and get our hands dirty with some ! We could optionally test the validity of our results at a later stage by fine-tuning the model on some downstream tasks.