Fine-Tuning a T5 model for sentence splitting in English

patrickvonplaten · June 23, 2021, 12:01pm

Fine-Tuning a Seq2Seq model for sentence splitting in English.

Sentence fusion is the task of dividing a long sentence into multiple sentences. E.g.: the sentence:

Mary likes to play football in her freetime whenever she meets with her friends that are very nice people.

could be split into

Mary likes to play football in her freetime whenever she meets with her friends. Her friends are very nice people.

Currently there is only one model on the hub for sentence splitting:

The goal of this project is to have the best sentence splitting model for English on the hub.

Model

One use one or multiple of the pretrained T5 models:

Datasets

The Wikisplit dataset can be used: WikiSplit – Google Research

Available training scripts

As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training.

(Optional) Desired project outcome

The desired outcome is to have a sentence splitting model for the English language. This can be showcased directly on the hub or with a streamlit or gradio app.

(Optional) Challenges

Preprocessing the dataset can be challenging, but should be feasible. It might even be a nice side-project to add the Wikisplit dataset: WikiSplit – Google Research to the datasets library while training the model.

ALso it could be difficult to beat the existing model: google/roberta2roberta_L-24_wikisplit · Hugging Face

(Optional) Links to read upon

hgarg · June 25, 2021, 4:09pm

Interested to be part of this project

bhadresh-savani · July 1, 2021, 3:11am

This is cool project. I am also interested!

patrickvonplaten · July 1, 2021, 10:02am

Great let’s finalize it 2 is enough!

bhadresh-savani · July 5, 2021, 5:37pm

We already have wiki_split dataset added in the datasets
I think we need to modify this data and make it in seq to seq format

Topic		Replies	Views
Fine-Tune a T5 for sentence fusion Flax/JAX Projects	1	1006	June 25, 2021
Pretrain T5 from scratch in Dutch Flax/JAX Projects	2	2089	July 7, 2021
How does huggingface T5 flax pretraining script handles very long sentences? 🤗Transformers	0	361	May 4, 2022
Finetuning T5 on translation task 🤗Transformers	0	489	September 10, 2021
Finetuning T5 for Summarisation - Poor results Intermediate	1	527	April 28, 2024