Pretrain T5 from scratch in Dutch

There are currently no Dutch seq2seq models on the hub. There exist multilingual seq2seq models such as mbart and mt5, but no Dutch (nl) only models. This project aims to pre-train T5 from scratch using the Dutch segment of mC4.

Language

Dutch

Model

Randomly initialized T5 model

Datasets

Our primary dataset for pre-training is the Dutch part of multilingual C4, and separately downloaded and curated Dutch news site documents from Common Crawl. We re-use these datasets that have been already cleaned for the project in the FlaxJax week that trains BigBird from scratch on Dutch. The cleaning uses an adapted version of the clean script used for the english c4, that filters badwords and content that are not sentences. Details and source of the data cleaning will be provided on the model hub for this project.
The dataset for fine-tuning will be CNN and XSUM translated to Dutch, and news summaries from the Dutch nu.nl website downloaded from CommonCrawl.

Training script

The starting point for the training will be transformers/run_t5_mlm_flax.py at master · huggingface/transformers · GitHub
This will result in a model that needs to be fine-tuned on a downstream task.
For the demo we will fine-tune the model to perform summarization of news articles in Dutch using transformers/run_summarization_flax.py at 7d6285a921a23c06169e2d90c94faa0d92d00d78 · huggingface/transformers · GitHub

Challenges

Adapting the training script for our use case and custom datafiles.

Desired project outcome

  • A T5 model that can be finetuned for Dutch seq2seq tasks
  • A T5 model that can summarize news articles in the Dutch language
  • A streamlit app that demonstrates news article summarization
2 Likes

Same team!

Finally :slight_smile: Great! Would love to help here