To be in the club, to be in the model hub

ErykWdowiak · March 21, 2021, 7:45pm

I’m writing to ask what it means that some of the most advanced language models are available on Hugging Face’s model hub. And I’m writing to ask what it means if your language does not appear anywhere in the model hub.

What does it mean that I can download Google’s BERT, OpenAI’s GPT2 or Facebook’s M2M 100 and run them on my laptop? What does it mean that Hugging Face has made these models so easy to use that I can batch translate a whole document in just a few lines of code?

For the millions of underserved people who speak one of the top 100 languages, this is wonderful! We can translate books, translate Wikipedia and create an endless stream of educational resources.

Meanwhile the millions of people who don’t speak one of those 100 languages won’t get any new textbooks. For educational materials, they’ll remain dependent on a foreign language – usually the language of their colonizer – until they start translating.

But they can start translating! A few dedicated people could translate enough sentence pairs to train a basic machine translator. Then with back-translation, multilingual translation and other tricks, they could get it up to a respectable quality.

Will they start translating? Or will those languages wither away because they’re not in today’s top 100? What’s the significance of your language being included in these models? What does it mean if your language is not included?

raphaelmerx · June 24, 2021, 6:17am

Hi Eryk,

Indeed there’s a “winner takes it all phenomenon” for languages, and building models that only support the most well-resources languages participates in that trend. But so does our usage of English on this forum!

On a more positive note, the 1000+ models released by Helsinki-NLP using OPUS include some pretty low-resource languages.

Another idea would be to fine-tune BERT to translating between a low-resource language and English, making use of the low-resource language similarity with other top 100 languages. For example, the Tetun language (which I work with) is low-resource, but has similarities with both Portuguese and Indonesian, which are part of BERT. Is that something that you’ve considered?

Topic		Replies	Views
Translation model to 100+ Languages Research	4	1937	January 25, 2025
A service to translate datasets into other languages 🤗Datasets	1	860	June 6, 2023
Language pair with multiple models on the model hub? 🤗Transformers	1	338	August 10, 2020
BCP-47 or at least ISO 639-3 support in Model Hub tags Languages at Hugging Face	2	1034	June 12, 2022
Translate the docs Community Calls	1	22	April 23, 2025

To be in the club, to be in the model hub

Related topics