Pre-training/fine-tuning Seq2Seq model for spelling and/or grammar correction in English

Pre-training/fine-tuning Seq2Seq model for spelling and/or grammar correction

For this project, one can use a randomly initialized or a pre-trained BART/T5 model.


Pre-trained BART, T5 models can be found on the model hub.


The dataset for this model can be prepared as described in this blog post.
One can make use OSCAR . The dataset is also available through the datasets library here: oscar · Datasets at Hugging Face.

Available training scripts

As this will be a Seq2Seq model, the script can be used for training.

(Optional) Desired project outcome

The desired outcome is to train a spelling correction model for the English language. This can be showcased directly on the hub or with a streamlit or gradio app.

(Optional) Challenges

Implementing the dataset noising function would be the challenging part of the project.

(Optional) Links to read upon


I have been working on automatic punctuation prediction so I am keen to explore other aspects of language correction as well. This project would be great for that. Count me in!

1 Like

This sounds very interesting. Count me in!

1 Like

Really Interesting! I’d like to join :slight_smile:

1 Like

great finalizing the project :slight_smile:


Did you guys open a discord channel? Or do you have other means of communication? @naruto7 - feel free to go ahead and open a discord channel if no one has done it yet :slight_smile:


Hello @valhalla.
Did you publish a model and/or a notebook/script for this spelling correction model? Thanks.

we did not get team to do this project, so sadly, no.

Hi @valhalla.

Just to say that I have continued to search this topic for spelling corrections (in particular, about ASR errors correction) and just posted informations that may be of interest to people of similar interest.