BCP-47 or at least ISO 639-3 support in Model Hub tags

Google just released a paper on training a language model for 1000 languages. If they were to put their model on :hugs: at most 183 of those languages would show up in the :hugs: hub, because it only shows ISO 639-1 languages. I personally added a dataset that includes 615 languages, but in the :hugs: hub, it doesn’t show up at all in the language categories because I used 639-3 tags, which are the ones required as input for the data loading. People working on these languages have to use URL hacks to search for them, including languages like Cebuano with well over a dozen datasets on :hugs: and tens of millions of native speakers. I think NLP has outgrown the set of languages described by ISO 639-1 and if :hugs: is to be an inclusive place, it really needs to provide the same ease of access to users of the rest of the world’s languages. I suggest that tags input by users be standardized with BCP-47 and that all of the tags in use should be accessible through the tags on :hugs: hub.

I’d happily take part in an effort to update this, but don’t know where in the codebase to find the places where these changes should be made. Anyone more familiar with the codebase able to guide me to the right spots?

4 Likes

Thanks so much for starting this discussion @jnemecek! I would also be very interested in helping with such an effort. I know that SIL (the org that @jnemecek and I are a part of, who helps publish and update ISO639-3 codes) helps with things like the LDML, with an up-to-date API that specifies language names, alternate names (including names preferred by the community itself), ISO codes, region, script types, etc. http://ldml.api.sil.org/langtags.json

Maybe there are ways we can create sustainable workflows around maintaining tags, updating tags (e.g., when a language code is retired or updates), and providing additional metadata (like alternate names)? I can round up some people to contribute and help kick start some things on this side.

I would strongly support this as well! While the two character ISO 639-1 tags have certain advantages of convenience and familiarity, they are very Eurocentric and incomplete. I think the best idea would be to move to BCP 47 (RFC 5646), which maintains the convenience/compatibility of using ISO 639-1 tags where they are defined, while allowing ISO 639-3 etc. tags for all other languages. Also, it supports including the script in a tag, which is important in a number of cases where multiple scripts are used for a language.

1 Like