So at my job I recently worked on doing text cleaning for Tamasheq, a language group spoken in what is Today Mali, Niger and Algeria. This text had to do with creating ASR for this language.
We had the resources to create a data labeling team (transcription).
As none of the labelers use the language in writing, there was a lot of noise -the labelers did not agree on spelling. The variance was interesting from a phonological and morphological perspective.
Some of the text data saw as many as 2 editing passes (3 total passes). This means that the same text data + audio was listened to and “cleaned” by the best (most consistent) transcriptionists.
I studied the multipass data and even implemented some phonologically based text normalization. However this approach was, fun but often tedious and not very fruitful.
What language-agnostic or statistical approaches exist for normalizing text? I am new to NLP (I am a linguist) and could use some clues as to where to start reading.
Thanks in advance for any pointer and advice.