How to Fine-Tune mBART or mT5 for Transliteration from Romanized Text to Native Script?

Sameera827 · August 26, 2024, 5:20pm

I’m working on a project that involves converting Romanized text (where the characters are in the Latin alphabet) into its corresponding native script. Specifically, I’m dealing with Romanized Hindi (e.g., “aap kaise hain?”) and need to convert it into native Hindi script (“आप कैसे हैं?”).

Both the mBART and mT5 models, which are trained on Hindi among other languages, seem like good candidates for this task, but I’m unsure of the best approach to fine-tune these models for transliteration.

Here’s what I’m working with:

Data: I have a parallel dataset where each Romanized sentence is paired with its corresponding native script sentence.
Models: I’m considering mBART and mT5, as they both support Hindi, which is crucial for my task.

I need help with the following:

Data Preparation:

How should I format my dataset for fine-tuning? Should I treat this as a translation task where the input is Romanized text, and the output is the native script?

Model Fine-Tuning:

What steps should I follow to fine-tune mBART or mT5 for this specific transliteration task?
Are there specific hyperparameters or training settings that work well for transliteration?

Topic		Replies	Views
Using BART, T5, mBART, and mT5 for translation of a new language Beginners	0	111	August 12, 2024
MBart Zero Shot Transfer Learning Beginners	0	350	June 4, 2021
Encoding error with fine-tuned model Models	1	823	October 4, 2021
Dataset size for fine-tuning Beginners	0	598	May 21, 2021
How to finetune mT5 🤗Transformers	0	628	July 19, 2021

How to Fine-Tune mBART or mT5 for Transliteration from Romanized Text to Native Script?

Related topics