Please read the topic category description to understand what this is all about
Description
Nowadays you can find spellchecker everywhere - on your phone, Microsoft Word, and so on. The goal of this project is to train a Transformer model to automatically correct our spelling in a language of your choosing!
Model(s)
You can frame spellchecking as a sequence-to-sequence task, so a good starting point is to checkout the machine translation example in Chapter 7 of the Course. Once you understand that, a T5 or mT5 model is a good start to train your models.
Datasets
The GitHub Typo corpus is a good place to start. An alternative is to use back-translation to create your own corpus of noisy labels, since most machine translation systems typically introduce small errors this way.
Challenges
This is a rather open-ended project, and one that might require some careful data preprocessing / augmentation. A good starting strategy would be to adapt the example given in the resources below, but using the ecosystem instead of the
fairseq
library.
Desired project outcomes
- Create a Streamlit or Gradio app on
Spaces that [Fill description]
- Donβt forget to push all your models and datasets to the Hub so others can build on them!