Pre-training/fine-tuning Seq2Seq model for spelling and/or grammar correction
For this project, one can use a randomly initialized or a pre-trained BART/T5 model.
Pre-trained BART, T5 models can be found on the model hub.
The dataset for this model can be prepared as described in this blog post.
One can make use OSCAR . The dataset is also available through the
datasets library here: oscar · Datasets at Hugging Face.
Available training scripts
As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training.
(Optional) Desired project outcome
The desired outcome is to train a spelling correction model for the English language. This can be showcased directly on the hub or with a streamlit or gradio app.
Implementing the dataset noising function would be the challenging part of the project.
(Optional) Links to read upon