Fine-Tuning a T5 model for sentence splitting in English

Fine-Tuning a Seq2Seq model for sentence splitting in English.

Sentence fusion is the task of dividing a long sentence into multiple sentences. E.g.: the sentence:

Mary likes to play football in her freetime whenever she meets with her friends that are very nice people.

could be split into

Mary likes to play football in her freetime whenever she meets with her friends. Her friends are very nice people.

Currently there is only one model on the hub for sentence splitting:

The goal of this project is to have the best sentence splitting model for English on the hub.

Model

One use one or multiple of the pretrained T5 models:

Datasets

The Wikisplit dataset can be used: WikiSplit – Google Research

Available training scripts

As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training.

(Optional) Desired project outcome

The desired outcome is to have a sentence splitting model for the English language. This can be showcased directly on the hub or with a streamlit or gradio app.

(Optional) Challenges

Preprocessing the dataset can be challenging, but should be feasible. It might even be a nice side-project to add the Wikisplit dataset: WikiSplit – Google Research to the datasets library while training the model.

ALso it could be difficult to beat the existing model: google/roberta2roberta_L-24_wikisplit · Hugging Face

(Optional) Links to read upon

4 Likes

Interested to be part of this project

1 Like

This is cool project. I am also interested!

1 Like

Great let’s finalize it :slight_smile: 2 is enough!

1 Like

We already have wiki_split dataset added in the datasets
I think we need to modify this data and make it in seq to seq format