Generating Synthetic Data for Machine Translation of Dialects

Hi all. Currently I am working on a project to use natural language processing to effectively and efficiently translate dialects of a given language. Here is the rest of my abstract of my project here:

Current systems are not publicly available for translations of specific dialects… Demand is growing to bridge the gap between dialects and the standard form of a given language. This problem is especially relevant to immigrant and refugee populations who speak dialects of their native language. The dialects these communities speak are typically considered “low resource” languages and therefore do not have many publicly available resources to help people in these communities learn English. By effectively translating dialects of languages, it will help people around the world communicate and interact with each other.

One of the main issues in translating dialects is the lack of training data available to train models for these dialects. This is a prerequisite issue to address before creating machine learning models to translate dialects. I am to tackle this problem by first synthetically creating english, main language, and dialect dataset. I will use the multi-lingual NMT approach to handle storing information about different dialects. This kind of approach takes into account that dialects of languages are similar to each other, so this model will store the similarities between the dialects. Then, we will use a text generation API to generate english/dialect pairs to form a dataset. Once we create a dataset in the target dialect, I will use a state-of-the-art translation model to train our model on the data we synthetically created. After training the model, my eventual goal is to publish the model to an app that can translate dialects of languages.

I was wondering if you all had advice on how I could create synthetically generated datasets, perhaps using a text generation software or something like that. Any feedback would be appreciated.

Thank you!

Hi Sparkles,
I’m going ti work on a project similar to your: create a traduction model from a local dialect to the main language. Did you go further on your work? Do you have some advices?

Thank you in advance
Federico

Hi all!

I’m also interested in this, currently I’ve gathered about 18k parallel corpus for a low resource language and still find it diffifult to get decent results. I’ve gather that data from a few websites that have tranlations, some facebook groups, government documents that teach the language and some from the bible corpus.

What I will try next is to create some few more frases with my limited knowledge of the language but still I don’t hope to get much better results.

I might also try the modeled mentioned in this papel Paper page - SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages

If anyone has any suggestions I’d also be glad to hear them.

Thanks!