Hi all. Currently I am working on a project to use natural language processing to effectively and efficiently translate dialects of a given language. Here is the rest of my abstract of my project here:
Current systems are not publicly available for translations of specific dialects… Demand is growing to bridge the gap between dialects and the standard form of a given language. This problem is especially relevant to immigrant and refugee populations who speak dialects of their native language. The dialects these communities speak are typically considered “low resource” languages and therefore do not have many publicly available resources to help people in these communities learn English. By effectively translating dialects of languages, it will help people around the world communicate and interact with each other.
One of the main issues in translating dialects is the lack of training data available to train models for these dialects. This is a prerequisite issue to address before creating machine learning models to translate dialects. I am to tackle this problem by first synthetically creating english, main language, and dialect dataset. I will use the multi-lingual NMT approach to handle storing information about different dialects. This kind of approach takes into account that dialects of languages are similar to each other, so this model will store the similarities between the dialects. Then, we will use a text generation API to generate english/dialect pairs to form a dataset. Once we create a dataset in the target dialect, I will use a state-of-the-art translation model to train our model on the data we synthetically created. After training the model, my eventual goal is to publish the model to an app that can translate dialects of languages.
I was wondering if you all had advice on how I could create synthetically generated datasets, perhaps using a text generation software or something like that. Any feedback would be appreciated.
Thank you!