Fine-Tuning a Seq2Seq model for sentence splitting in English.
Sentence fusion is the task of dividing a long sentence into multiple sentences. E.g.: the sentence:
Mary likes to play football in her freetime whenever she meets with her friends that are very nice people.
could be split into
Mary likes to play football in her freetime whenever she meets with her friends. Her friends are very nice people.
Currently there is only one model on the hub for sentence splitting:
The goal of this project is to have the best sentence splitting model for English on the hub.
Model
One use one or multiple of the pretrained T5 models:
Datasets
The Wikisplit dataset can be used: WikiSplit – Google Research
Available training scripts
As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training.
(Optional) Desired project outcome
The desired outcome is to have a sentence splitting model for the English language. This can be showcased directly on the hub or with a streamlit or gradio app.
(Optional) Challenges
Preprocessing the dataset can be challenging, but should be feasible. It might even be a nice side-project to add the Wikisplit dataset: WikiSplit – Google Research to the datasets
library while training the model.
ALso it could be difficult to beat the existing model: google/roberta2roberta_L-24_wikisplit · Hugging Face