Pretrain T5 from scratch in Dutch

yhavinga · July 6, 2021, 6:06pm

There are currently no Dutch seq2seq models on the hub. There exist multilingual seq2seq models such as mbart and mt5, but no Dutch (nl) only models. This project aims to pre-train T5 from scratch using the Dutch segment of mC4.

Language

Dutch

Model

Randomly initialized T5 model

Datasets

Our primary dataset for pre-training is the Dutch part of multilingual C4, and separately downloaded and curated Dutch news site documents from Common Crawl. We re-use these datasets that have been already cleaned for the project in the FlaxJax week that trains BigBird from scratch on Dutch. The cleaning uses an adapted version of the clean script used for the english c4, that filters badwords and content that are not sentences. Details and source of the data cleaning will be provided on the model hub for this project.
The dataset for fine-tuning will be CNN and XSUM translated to Dutch, and news summaries from the Dutch nu.nl website downloaded from CommonCrawl.

Training script

The starting point for the training will be transformers/run_t5_mlm_flax.py at master · huggingface/transformers · GitHub
This will result in a model that needs to be fine-tuned on a downstream task.
For the demo we will fine-tune the model to perform summarization of news articles in Dutch using transformers/run_summarization_flax.py at 7d6285a921a23c06169e2d90c94faa0d92d00d78 · huggingface/transformers · GitHub

Challenges

Adapting the training script for our use case and custom datafiles.

Desired project outcome

A T5 model that can be finetuned for Dutch seq2seq tasks
A T5 model that can summarize news articles in the Dutch language
A streamlit app that demonstrates news article summarization

Dat · July 6, 2021, 6:08pm

Same team!

nielsr · July 7, 2021, 10:37am

Finally Great! Would love to help here

Topic		Replies	Views
PreTrain T5 for Italian 🇮🇹 Flax/JAX Projects	3	618	July 7, 2021
Train Dutch FlaxBigBird Flax/JAX Projects	6	786	July 2, 2021
Pretrain T5 for Arabic Flax/JAX Projects	17	2684	June 11, 2023
Pretrain and Fine Tune Byte-level model for multilingual extractive QA (Like ByT5) Flax/JAX Projects	13	1983	July 2, 2021
Pre-training/fine-tuning Seq2Seq model for spelling and/or grammar correction in French Flax/JAX Projects	6	2008	August 11, 2021