I’m working on a project that involves converting Romanized text (where the characters are in the Latin alphabet) into its corresponding native script. Specifically, I’m dealing with Romanized Hindi (e.g., “aap kaise hain?”) and need to convert it into native Hindi script (“आप कैसे हैं?”).
Both the mBART and mT5 models, which are trained on Hindi among other languages, seem like good candidates for this task, but I’m unsure of the best approach to fine-tune these models for transliteration.
Here’s what I’m working with:
- Data: I have a parallel dataset where each Romanized sentence is paired with its corresponding native script sentence.
- Models: I’m considering mBART and mT5, as they both support Hindi, which is crucial for my task.
I need help with the following:
- Data Preparation:
- How should I format my dataset for fine-tuning? Should I treat this as a translation task where the input is Romanized text, and the output is the native script?
- Model Fine-Tuning:
- What steps should I follow to fine-tune mBART or mT5 for this specific transliteration task?
- Are there specific hyperparameters or training settings that work well for transliteration?