Text normalization for low-resource languages

DenizciMoe · March 3, 2023, 7:06pm

So at my job I recently worked on doing text cleaning for Tamasheq, a language group spoken in what is Today Mali, Niger and Algeria. This text had to do with creating ASR for this language.

We had the resources to create a data labeling team (transcription).
As none of the labelers use the language in writing, there was a lot of noise -the labelers did not agree on spelling. The variance was interesting from a phonological and morphological perspective.

Some of the text data saw as many as 2 editing passes (3 total passes). This means that the same text data + audio was listened to and “cleaned” by the best (most consistent) transcriptionists.

I studied the multipass data and even implemented some phonologically based text normalization. However this approach was, fun but often tedious and not very fruitful.

What language-agnostic or statistical approaches exist for normalizing text? I am new to NLP (I am a linguist) and could use some clues as to where to start reading.

Thanks in advance for any pointer and advice.

Topic		Replies	Views
Addition of a new language (Chadian Arabic ‘shu’) to the NLP, LLM models Beginners	0	10	January 11, 2025
Working on Low Resource Machine Translation Research	2	569	June 27, 2023
Generating Synthetic Data for Machine Translation of Dialects Research	2	1505	October 2, 2024
Model Suggestion on Text correction Beginners	0	764	April 2, 2021
Arabic NLP - Resources Languages at Hugging Face	2	2050	September 8, 2021

Text normalization for low-resource languages

Related topics