Sentence fusion is the task of dividing a long sentence into multiple sentences. E.g.: the sentence:
Mary likes to play football in her freetime whenever she meets with her friends that are very nice people.
could be split into
Mary likes to play football in her freetime whenever she meets with her friends. Her friends are very nice people.
Currently there is only one model on the hub for sentence splitting:
The goal of this project is to have the best sentence splitting model for English on the hub.
One use one or multiple of the pretrained T5 models:
The Wikisplit dataset can be used: WikiSplit – Google Research
As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training.
The desired outcome is to have a sentence splitting model for the English language. This can be showcased directly on the hub or with a streamlit or gradio app.
Preprocessing the dataset can be challenging, but should be feasible. It might even be a nice side-project to add the Wikisplit dataset: WikiSplit – Google Research to the
datasets library while training the model.
ALso it could be difficult to beat the existing model: google/roberta2roberta_L-24_wikisplit · Hugging Face