Google just released a paper on training a language model for 1000 languages. If they were to put their model on at most 183 of those languages would show up in the hub, because it only shows ISO 639-1 languages. I personally added a dataset that includes 615 languages, but in the hub, it doesn’t show up at all in the language categories because I used 639-3 tags, which are the ones required as input for the data loading. People working on these languages have to use URL hacks to search for them, including languages like Cebuano with well over a dozen datasets on and tens of millions of native speakers. I think NLP has outgrown the set of languages described by ISO 639-1 and if is to be an inclusive place, it really needs to provide the same ease of access to users of the rest of the world’s languages. I suggest that tags input by users be standardized with BCP-47 and that all of the tags in use should be accessible through the tags on hub.
I’d happily take part in an effort to update this, but don’t know where in the codebase to find the places where these changes should be made. Anyone more familiar with the codebase able to guide me to the right spots?