BCP-47 or at least ISO 639-3 support in Model Hub tags

jnemecek · May 20, 2022, 2:10pm

Google just released a paper on training a language model for 1000 languages. If they were to put their model on at most 183 of those languages would show up in the hub, because it only shows ISO 639-1 languages. I personally added a dataset that includes 615 languages, but in the hub, it doesn’t show up at all in the language categories because I used 639-3 tags, which are the ones required as input for the data loading. People working on these languages have to use URL hacks to search for them, including languages like Cebuano with well over a dozen datasets on and tens of millions of native speakers. I think NLP has outgrown the set of languages described by ISO 639-1 and if is to be an inclusive place, it really needs to provide the same ease of access to users of the rest of the world’s languages. I suggest that tags input by users be standardized with BCP-47 and that all of the tags in use should be accessible through the tags on hub.

I’d happily take part in an effort to update this, but don’t know where in the codebase to find the places where these changes should be made. Anyone more familiar with the codebase able to guide me to the right spots?

dwhitena · May 20, 2022, 3:05pm

Thanks so much for starting this discussion @jnemecek! I would also be very interested in helping with such an effort. I know that SIL (the org that @jnemecek and I are a part of, who helps publish and update ISO639-3 codes) helps with things like the LDML, with an up-to-date API that specifies language names, alternate names (including names preferred by the community itself), ISO codes, region, script types, etc. http://ldml.api.sil.org/langtags.json

Maybe there are ways we can create sustainable workflows around maintaining tags, updating tags (e.g., when a language code is retired or updates), and providing additional metadata (like alternate names)? I can round up some people to contribute and help kick start some things on this side.

manning · June 12, 2022, 5:59pm

I would strongly support this as well! While the two character ISO 639-1 tags have certain advantages of convenience and familiarity, they are very Eurocentric and incomplete. I think the best idea would be to move to BCP 47 (RFC 5646), which maintains the convenience/compatibility of using ISO 639-1 tags where they are defined, while allowing ISO 639-3 etc. tags for all other languages. Also, it supports including the script in a tag, which is important in a number of cases where multiple scripts are used for a language.

Topic		Replies	Views
Marian: Language Discovery questions 🤗Transformers	6	1571	September 15, 2020
To be in the club, to be in the model hub Languages at Hugging Face	1	771	June 24, 2021
Feature request: model tags Site Feedback	2	606	August 20, 2020
Can you add Kalmyk Language to dataset card languages? 🤗Datasets	2	12	June 5, 2025
How to find Multilingual models Models	1	385	August 18, 2021

BCP-47 or at least ISO 639-3 support in Model Hub tags

Related topics