Hi, I’ve noticed that there are two “Turkish” tags available under the language filters. This happens to be an issue because selecting both will perform an AND search (which results in nearly no results). Both tags appear to be commonly used, for example one instance of the Turkish tag is used by Wikimedia’s account while the other is used by HuggingfaceFW. Is there a historical reason on why the two are separate? It’s quite inconvenient to have to search both tags individually for models and datasets.
This is because the implementation allows for multiple language-codes.
As far as the display on the Hub GUI, it’s always been like that. Or so I thought, but judging by GitHub, it seems to be a display bug that’s remained unresolved since 2022… Since they apparently intended to fix it, it’s probably not a design choice. It doesn’t cause any real harm, but I suppose it’s fair to treat it as a bug…
My best read is this:
The current behavior is split into two layers.
At the metadata layer, it looks intended that both tr and tur can exist. At the search/filter UI layer, it looks unintended, or at least like an unfinished normalization problem, that those aliases show up as separate Turkish filters instead of one canonical Turkish bucket. (GitHub)
1. The background
Hugging Face repo cards explicitly allow language metadata to be written using ISO 639-1, ISO 639-2, or ISO 639-3 codes. Their code and docs both say that for models and datasets, language can be a two-letter or three-letter code. That means values like tr, tur, en, eng, fr, and fra are all valid metadata inputs. (GitHub)
That matters because it explains why duplicates can be created in the raw data. If one repo author writes tr and another writes tur, both are accepted by the platform. So duplicate language buckets are not surprising at the storage level. (GitHub)
2. What the current UI is actually doing
The current UI is not just accepting both codes in metadata. It is also surfacing them separately.
On Hugging Face’s /languages page, the same real-world language appears more than once with different codes and different counts. For example:
- English appears as
enwith 57,167 datasets / 312,494 models and also asengwith 1,896 / 1,184. - French appears as
frawith 10,291 / 1,999 and also asfrwith 3,002 / 16,956. - Turkish appears as
trwith 1,487 / 5,867 and also asturwith 165 / 80. - German appears as
dewith 2,349 / 14,974 and also asdeuwith 691 / 937. (Hugging Face)
The model search pages show the same split. Right now, language=tr returns 5,872 models, while language=tur returns 81. Likewise, language=en returns 312,770 models, while language=eng returns 1,186. That means the filter backend is treating these aliases as different search keys, not as one normalized language. (Hugging Face)
There is also a UI-level clue that normalization is weak. On the models filter page, the quick language list shows “English” twice and “French” twice in the same visible filter row. That is exactly what you would expect if multiple codes were being mapped to the same display label without being deduplicated first. (Hugging Face)
3. Why I think the metadata part is intended
This part is the easiest to call.
Hugging Face’s own repo-card code and docs do not restrict users to one canonical code family. They explicitly allow 639-1, 639-2, and 639-3. So a repository tagged with tr is valid, and a repository tagged with tur is also valid. That is not a bug by itself. (GitHub)
There is also a long-standing community push for broader language-code support, not narrower support. In the forum discussion about BCP-47 or at least ISO 639-3 support, users argue that two-letter codes are incomplete and that the Hub should support broader language identifiers. That aligns with Hugging Face allowing multiple code standards in metadata. (Hugging Face Forums)
So if the question is, “Should Hugging Face permit repos to carry tr or tur?” then the answer is yes, that appears intentional. (GitHub)
4. Why I think the UI behavior is probably not intended
This part is an inference, but a strong one.
The strongest evidence is issue hub-docs#193. In that issue, the discussion says a useful improvement would be to transform ISO 639-2 or 639-3 tags into ISO 639-1, and it gives fra versus fr as the concrete example. The stated reason is discoverability: datasets tagged fra should be findable as French. That is the opposite of the current UI behavior, where fr and fra are still separate filter buckets. (GitHub)
Hugging Face’s own Huggy Lingo blog post points the same way. It says that when their metadata-enrichment pipeline predicts a language in ISO 639-3, they convert it to ISO 639-1 where possible, and it explicitly says this is because ISO 639-1 codes have better support in the Hub UI for navigating datasets. That tells me the product thinking was not “keep all equivalent aliases separate in the UI.” It was closer to “accept broad inputs, but steer toward a canonical UI representation.” (GitHub)
Related GitHub issues also frame the broader language-filter situation as a problem, not as a settled design choice. One issue says datasets tagged by ISO language code were not accessible through the language search form. Another says some ISO 639-3 codes were present in the list but impossible to enter in the Hub. Those are exactly the kinds of bugs you see when storage, input widgets, and search normalization are not fully aligned. (GitHub)
The dataset issue about the language-code database goes even wider. It calls for connecting to a bigger language-code database and notes that the current list is partial and hard to maintain. That again sounds like an incomplete language-metadata system, not a deliberate choice to present alias duplicates as separate first-class languages forever. (GitHub)
5. What I think is happening technically
The simplest model is:
- Hugging Face accepts multiple code standards in repo metadata.
- The Hub stores and indexes those values largely as provided.
- The UI converts codes into human-readable names like “Turkish” or “English”.
- But the UI and search system do not fully canonicalize aliases before counting, filtering, or displaying them. (GitHub)
That would explain all of the observed behavior at once:
- why
trandturare both allowed, - why they get separate counts,
- why both show up as “Turkish,”
- why the same duplication shows up for English, French, German, and others,
- and why old GitHub issues talk about normalization and search discoverability. (GitHub)
6. My actual conclusion
Here is the plain version:
- Intended: Hugging Face allowing both
trandturin metadata. (GitHub) - Probably not intended as the final UX: the search UI treating those aliases as separate Turkish filters and separate count buckets. (GitHub)
So I would describe it as:
Not a metadata bug. Likely a search/filter normalization bug or product gap.
7. Confidence level
I am high confidence on the first part: accepting both code families is by design. (GitHub)
I am medium-high confidence on the second part: the current duplicate-filter UI is probably not intended behavior, because Hugging Face’s own issues and blog material point toward canonicalization for discoverability, not toward keeping alias codes as separate user-facing language buckets. I cannot prove that with a maintainer quote saying “this is a bug,” but the direction of the evidence is pretty clear. (GitHub)