To be in the club, to be in the model hub

I’m writing to ask what it means that some of the most advanced language models are available on Hugging Face’s model hub. And I’m writing to ask what it means if your language does not appear anywhere in the model hub.

What does it mean that I can download Google’s BERT, OpenAI’s GPT2 or Facebook’s M2M 100 and run them on my laptop? What does it mean that Hugging Face has made these models so easy to use that I can batch translate a whole document in just a few lines of code?

For the millions of underserved people who speak one of the top 100 languages, this is wonderful! We can translate books, translate Wikipedia and create an endless stream of educational resources.

Meanwhile the millions of people who don’t speak one of those 100 languages won’t get any new textbooks. For educational materials, they’ll remain dependent on a foreign language – usually the language of their colonizer – until they start translating.

But they can start translating! A few dedicated people could translate enough sentence pairs to train a basic machine translator. Then with back-translation, multilingual translation and other tricks, they could get it up to a respectable quality.

Will they start translating? Or will those languages wither away because they’re not in today’s top 100? What’s the significance of your language being included in these models? What does it mean if your language is not included?

Hi Eryk,

Indeed there’s a “winner takes it all phenomenon” for languages, and building models that only support the most well-resources languages participates in that trend. But so does our usage of English on this forum!

On a more positive note, the 1000+ models released by Helsinki-NLP using OPUS include some pretty low-resource languages.

Another idea would be to fine-tune BERT to translating between a low-resource language and English, making use of the low-resource language similarity with other top 100 languages. For example, the Tetun language (which I work with) is low-resource, but has similarities with both Portuguese and Indonesian, which are part of BERT. Is that something that you’ve considered?