Pre-training/fine-tuning Seq2Seq model for spelling and/or grammar correction
For this project, one can use a randomly initialized or a pre-trained BART/T5 model.
Model
Pre-trained BART, T5 models can be found on the model hub.
Datasets
The dataset for this model can be prepared as described in this blog post.
One can make use OSCAR . The dataset is also available through the datasets
library here: oscar · Datasets at Hugging Face.
Available training scripts
As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training.
(Optional) Desired project outcome
The desired outcome is to train a spelling correction model for the French language. This can be showcased directly on the hub or with a streamlit or gradio app.
(Optional) Challenges
Implementing the dataset noising function would be the challenging part of the project.