Generating Synthetic Data for Machine Translation of Dialects

sparkles · December 13, 2021, 2:03am

Hi all. Currently I am working on a project to use natural language processing to effectively and efficiently translate dialects of a given language. Here is the rest of my abstract of my project here:

Current systems are not publicly available for translations of specific dialects… Demand is growing to bridge the gap between dialects and the standard form of a given language. This problem is especially relevant to immigrant and refugee populations who speak dialects of their native language. The dialects these communities speak are typically considered “low resource” languages and therefore do not have many publicly available resources to help people in these communities learn English. By effectively translating dialects of languages, it will help people around the world communicate and interact with each other.

One of the main issues in translating dialects is the lack of training data available to train models for these dialects. This is a prerequisite issue to address before creating machine learning models to translate dialects. I am to tackle this problem by first synthetically creating english, main language, and dialect dataset. I will use the multi-lingual NMT approach to handle storing information about different dialects. This kind of approach takes into account that dialects of languages are similar to each other, so this model will store the similarities between the dialects. Then, we will use a text generation API to generate english/dialect pairs to form a dataset. Once we create a dataset in the target dialect, I will use a state-of-the-art translation model to train our model on the data we synthetically created. After training the model, my eventual goal is to publish the model to an app that can translate dialects of languages.

I was wondering if you all had advice on how I could create synthetically generated datasets, perhaps using a text generation software or something like that. Any feedback would be appreciated.

Thank you!

Fede231 · September 18, 2023, 3:29pm

Hi Sparkles,
I’m going ti work on a project similar to your: create a traduction model from a local dialect to the main language. Did you go further on your work? Do you have some advices?

Thank you in advance
Federico

alfsnd · October 2, 2024, 6:04pm

Hi all!

I’m also interested in this, currently I’ve gathered about 18k parallel corpus for a low resource language and still find it diffifult to get decent results. I’ve gather that data from a few websites that have tranlations, some facebook groups, government documents that teach the language and some from the bible corpus.

What I will try next is to create some few more frases with my limited knowledge of the language but still I don’t hope to get much better results.

I might also try the modeled mentioned in this papel Paper page - SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages

If anyone has any suggestions I’d also be glad to hear them.

Thanks!

Topic		Replies	Views
Using Transformers(?) for Tibetan-English Translation Research	0	506	June 21, 2023
Working on Low Resource Machine Translation Research	2	569	June 27, 2023
Finetuning a model for machine translation on a programming language Models	1	648	November 29, 2023
Translating Financial PhraseBank 🤗Datasets	1	737	February 25, 2021
How to create word embeddings for non-English languages using BERT-like models? Beginners	0	604	March 22, 2021

Generating Synthetic Data for Machine Translation of Dialects

Related topics