Pre-training/fine-tuning Seq2Seq model for spelling and/or grammar correction in English

valhalla · June 23, 2021, 12:23pm

Pre-training/fine-tuning Seq2Seq model for spelling and/or grammar correction

For this project, one can use a randomly initialized or a pre-trained BART/T5 model.

Model

Pre-trained BART, T5 models can be found on the model hub.

Datasets

The dataset for this model can be prepared as described in this blog post.
One can make use OSCAR . The dataset is also available through the datasets library here: oscar · Datasets at Hugging Face.

Available training scripts

As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training.

(Optional) Desired project outcome

The desired outcome is to train a spelling correction model for the English language. This can be showcased directly on the hub or with a streamlit or gradio app.

(Optional) Challenges

Implementing the dataset noising function would be the challenging part of the project.

(Optional) Links to read upon

naruto7 · June 23, 2021, 6:04pm

I have been working on automatic punctuation prediction so I am keen to explore other aspects of language correction as well. This project would be great for that. Count me in!

abhishekshrm071 · June 24, 2021, 8:38pm

This sounds very interesting. Count me in!

patrickvonplaten · June 29, 2021, 2:46pm

great finalizing the project

patrickvonplaten · July 4, 2021, 11:29am

Did you guys open a discord channel? Or do you have other means of communication? @naruto7 - feel free to go ahead and open a discord channel if no one has done it yet

pierreguillou · September 13, 2021, 7:46pm

Hello @valhalla.
Did you publish a model and/or a notebook/script for this spelling correction model? Thanks.

valhalla · September 14, 2021, 6:31am

we did not get team to do this project, so sadly, no.

pierreguillou · October 11, 2021, 5:16pm

Hi @valhalla.

Just to say that I have continued to search this topic for spelling corrections (in particular, about ASR errors correction) and just posted informations that may be of interest to people of similar interest.

Topic		Replies	Views
Pre-training/fine-tuning Seq2Seq model for spelling and/or grammar correction in French Flax/JAX Projects	6	2009	August 11, 2021
PreTrain BART on The Pile Flax/JAX Projects	19	1636	July 1, 2021
PreTrain T5 for Italian 🇮🇹 Flax/JAX Projects	3	618	July 7, 2021
Pretrain T5 from scratch in Dutch Flax/JAX Projects	2	2091	July 7, 2021
Pre-train a Seq2Seq model for a Quick Vietnamese Input Method by mapping Ascii syllables that missing marke and tones to UTF-8 syllables. E.g. toi noi tieng Viet => tôi nói tiếng Việt Flax/JAX Projects	0	932	June 23, 2021

Pre-training/fine-tuning Seq2Seq model for spelling and/or grammar correction in English