Pre-training/fine-tuning Seq2Seq model for spelling and/or grammar correction
For this project, one can use a randomly initialized or a pre-trained BART/T5 model.
Model
Pre-trained BART, T5 models can be found on the model hub.
Datasets
The dataset for this model can be prepared as described in this blog post.
One can make use OSCAR . The dataset is also available through the datasets library here: oscar · Datasets at Hugging Face.
The desired outcome is to train a spelling correction model for the English language. This can be showcased directly on the hub or with a streamlit or gradio app.
(Optional) Challenges
Implementing the dataset noising function would be the challenging part of the project.
I have been working on automatic punctuation prediction so I am keen to explore other aspects of language correction as well. This project would be great for that. Count me in!
Did you guys open a discord channel? Or do you have other means of communication? @naruto7 - feel free to go ahead and open a discord channel if no one has done it yet
Just to say that I have continued to search this topic for spelling corrections (in particular, about ASR errors correction) and just posted informations that may be of interest to people of similar interest.