For multilingual models, how do you balance performance across low-resource vs high-resource languages?

When you train an AI model that supports many languages, how do you make sure it works well for both popular and less common languages?

1 Like

Deps on model and use-case its hard to tell what implementation would best work for you.

Based on my past experience the data would prob need to contain a little bits of available languages for every training instance or the model would likely to ‘forget’ what it had learned in the past.

1 Like

During the fine-tuning phase, I think it depends on the actual weights of the model being trained. But during the pretraining phase, using more English (or the main language for that model) might result in a smarter model…?