Pre-train a Seq2Seq model for a Quick Vietnamese Input Method by mapping Ascii syllables that missing marke and tones to UTF-8 syllables. E.g. toi noi tieng Viet => tôi nói tiếng Việt

Problem: Most popular Vietnamese input method is Telex, that require additional keypress to create marks and tones for Vietnamese syllable. E.g: tooi => tôi, nois => nói, tieengs => tiếng, Vieetj => Việt. It’s work just fine it you have are using laptop, desktop that have a physical QWERTY keyboard (typing with 8 fingers). But it’s slow and error-prone while using smartphones (virtual keyboard typing with 2 fingers) or featured-phones (using T9 with only 0-9 keys). By using ML to map the minimal Ascii version of syllables to utf-8 with full marks and tone syllables, I hope that we can improve the situation. Beside pre-train the model using big data from pre-defined text corpus, on-the-fly learning / adapting from the document user is typing also help to improve the accuracy since we tend to repeating the same terms, keywords or phrases …

Model: I’m quite new to the field so have no idea which model is best for Vietnamese in general. Please discuss.

Data: GitHub - binhvq/news-corpus: Corpus tiếng việt
around 18.6 GB of internet news/articles crawled from 130 Vietnamese online news websites.

Method: As you see from the example in the project title. We can treat the problem as a sub-problem of spelling correction (deletion only). I guest can follow below project proposal as a foundation for the final solution.