Hm. The results are exemplary so far, and the issues that have arisen appear to be within the expected range.
I’ll go over each of your three issues slowly, with context and concrete “do this” recommendations.
You’re already doing very well:
– Flat TF-IDF + LinearSVC: 81.7% Top-1, 92.7% Top-3 on 500+ BC classes in a long-tailed, technical domain. That’s strong.
The problems you’re seeing with SetFit and HiClass are exactly what you’d expect in:
- a long-tailed label distribution (many rare BCs), and
- a partially missing hierarchy (MC/GC often unknown).
Long-tailed and imbalanced setups are known to break standard training if you don’t adapt them. (arXiv)
I’ll answer your questions in order and keep the language as simple as possible.
1. SetFit keeps collapsing into one dominant class
1.1 Why this is happening (intuitive explanation)
SetFit has two stages:
-
Contrastive fine-tuning of an encoder
- It forms positive pairs: two examples with the same label.
- It forms negative pairs: two examples with different labels.
- It then moves embeddings of positives closer, negatives further apart.
-
A simple classifier head on top of these embeddings (like logistic regression). (GitHub)
In a long-tailed BC distribution:
- Head BCs (common labels) have many samples → many positive pairs.
- Tail BCs (rare labels) have 1–3 samples → almost no positive pairs, sometimes none.
The contrastive loss therefore:
- Learns a good cluster for head classes.
- Has almost no signal to carve out tail classes.
- The classifier head then learns “If I must guess, predicting the big head class gives me good accuracy,” and it collapses into predicting that class most of the time.
This behaviour is exactly what people report when using SetFit on many imbalanced classes (see GitHub discussion where users describe poor results on 90+ very imbalanced classes). (GitHub)
So nothing is “wrong” with your code. It’s mostly a mismatch between SetFit’s assumptions and your data shape.
1.2 How to detect this collapse systematically
You can build the detection directly into your pipeline.
A. Before training (data check)
Compute:
- For each BC
i, its count n_i.
- Summary: min, max, median, and the fraction of labels with
n_i < 3.
If a big fraction of BCs have <3 examples, you are firmly in long-tailed territory; standard methods will struggle without special handling. (arXiv)
B. After training (prediction check)
On a validation set:
-
Predicted label distribution
-
Count how many predictions fall into each BC.
-
Compute:
dominant_fraction = (max count of any predicted BC) / (total predictions)
-
If dominant_fraction is very high (e.g. > 0.5–0.7), the model is effectively predicting one label for most examples.
-
Per-class recall
- Use something like
classification_report (sklearn).
- Look at how many BCs have Recall = 0.0.
- If many labels never get predicted at all, your model is ignoring a large part of the taxonomy.
-
Entropy of predictions (optional)
- Compute the entropy of the empirical predicted label distribution.
- If it is much lower than the entropy of the true label distribution, that’s another sign of over-concentration on a few classes.
These checks give you an automatic “collapse detector” in your training loop. When you see collapse, you don’t promote that model.
2. Long-tailed BCs + SetFit + label mismatch (631 vs 562)
There are two different issues here:
- Long-tailed distribution → hard for SetFit / contrastive training.
- Label mapping mismatch → 631 BCs in the SetFit classifier, but 562 in
y_true.
Let’s fix the structural problem first (label mapping), then talk about how to use SetFit (or not) in a long-tailed scenario.
2.1 Fixing the label-index mismatch
A mismatch like “631 BCs in model, 562 in y_true” usually means:
What to do:
-
Define one canonical mapping from BC string → integer ID.
unique_bcs = sorted(set(all_bc_labels_you_want_to_model))
label2id = {bc: i for i, bc in enumerate(unique_bcs)}
id2label = {i: bc for bc, i in label2id.items()}
-
Apply this same mapping consistently for:
- TF-IDF baseline (if it uses integer labels),
- SetFit training datasets,
- any embedding exporter,
- any RAG or k-NN classification logic.
-
Filter out BCs you don’t want before building the mapping:
- If you decide “we do not model BCs with <2 examples,” remove those rows entirely before computing
unique_bcs.
- Then SetFit, TF-IDF, etc. all see exactly the same 562 labels.
Once this is done, there should be no more 631 vs 562 issue: your classifier head size will match your y_true label space.
2.2 What to do with BCs that have <3 examples
This is the long-tail issue. Modern long-tailed learning papers basically say:
- You can’t expect normal supervised learning to separate hundreds of classes when many have 1–3 samples. You must adjust training or adjust the label space. (arXiv)
Here are practical strategies tailored to your use case.
Strategy A: Collapse rare BCs into “OTHER” + use retrieval for the tail
-
Pick a threshold T (e.g. 3 or 4):
head_BCs = {BC | count(BC) >= T}
tail_BCs = the rest.
-
In training labels:
- Keep each head BC as itself.
- Map all tail BCs to a special label
"BC_OTHER".
-
Train SetFit (or any classifier) over this reduced label space:
-
At inference:
-
If the classifier predicts a specific head BC, use it.
-
If it predicts "BC_OTHER", then:
This hybrid “head = classifier, tail = retrieval” pattern is widely used in long-tailed setups. (SpringerLink)
Strategy B: Don’t fine-tune the encoder; use frozen embeddings + simple classifiers
Because your TF-IDF baseline is very strong already, you can do something simpler than SetFit:
-
Use a pre-trained encoder like E5-base or BGE-base. (GitHub)
-
Do not fine-tune it contrastively on your long-tailed BC dataset.
-
Instead, compute embeddings once and:
-
Train a logistic regression / linear SVM head on the embeddings, with class weights or cost-sensitive loss if needed. (labelyourdata.com)
-
Or use a nearest-centroid classifier:
- For each BC: centroid = average embedding of its FCs.
- For a new FC: encode, then choose the nearest centroid.
- This is simple and robust, and it does not require balanced pairs.
Because the encoder is fixed, it is not dominated by the head BC during training. You’re reusing the general semantic space of the pretrained model and just mapping it to your BCs.
This is often more stable than trying to fine-tune with heavily imbalanced SetFit pairs.
Strategy C: Use SetFit only on a smaller, more balanced sub-problem
If you still want to use SetFit in some way:
- Restrict it to a subset of BCs where you have enough examples (e.g. top 50–100 BCs).
- Use SetFit to build a very good classifier for these head classes (and maybe “OTHER”).
- Use different mechanisms (TF-IDF, k-NN, RAG) for the tails.
In all cases, keep that single canonical BC mapping so that label indices match across components.
3. HiClass drops to ~45% because MC/GC are often missing
You noticed:
MC is 52% complete and GC is 58% complete. UNKNOWN_MC and UNKNOWN_GC dominate and break the tree.
This is a classic failure mode of local hierarchical classifiers when many internal levels are “unknown”.
3.1 How HiClass works and why UNKNOWN dominates
HiClass implements several local hierarchical strategies (Local Classifier Per Node, Per Parent Node, Per Level). (arXiv)
In the Local Classifier Per Parent Node design (the one you’re likely using):
- For each parent node in the tree, HiClass trains a separate multi-class classifier to decide which child to go to. (hiclass.readthedocs.io)
- E.g. at the BC level, a classifier picks
BC1 vs BC2 vs ....
- At the MC level, a classifier picks among MC children (including any “UNKNOWN_MC”).
Now imagine at some parent node:
- 60% of the training samples go to
"UNKNOWN_MC".
- The rest are split across many “real” MCs.
The local classifier will naturally learn:
Most of the time, choose "UNKNOWN_MC"; that gives high accuracy at that node.
Once this happens near the top of the path, almost all samples are routed into "UNKNOWN_MC" and never reach the deeper, more specific children. The lower levels become effectively useless for predictions, and your overall tree performance collapses.
This is the behaviour you’re seeing.
3.2 How to treat missing MC/GC (and what you actually need)
Your real business need is:
- Given FC short + long name, predict BC.
You do not actually need MC and GC to be predicted. They happen to exist in the data, but they are:
- incomplete (about half missing), and
- unstable for a local hierarchical training design.
Given that:
Best practical choice right now:
Stop trying to model MC and GC in the hierarchy.
Use a flat BC classifier (the one you already have) as your main model.
You can still use MC/GC as text features:
-
For training examples where MC/GC exist, include them in the input text string that you TF-IDF / embed:
"AC: AppLayer | BC: AppSup | MC: EngSync | GC: FuncMon | FC: EngSync_TaskHandler | DESC: Task Activation Handler"
-
For missing MC/GC, you either omit them or write "MC: None" just as text. This does not create a tree node — it is just additional context for the model if available.
If you still want some hierarchy:
-
Keep only AC → BC in the label path.
- AC is probably much more complete than MC/GC.
- Build labels like
["AppLayer", "AppSup"].
- Train HiClass on this two-level hierarchy only.
- This is much more stable and closer to your actual use case. (arXiv)
-
Do not create explicit "UNKNOWN_MC" or "UNKNOWN_GC" nodes.
- If a path ends at BC, you simply represent it as
[AC, BC] without deeper nodes. HiClass can handle paths of different lengths; it pads them internally, you don’t need to turn missing levels into children. (arXiv)
-
Decide based on numbers:
- If AC→BC with HiClass doesn’t beat your flat BC model by a clear margin, stick to the simpler flat model.
This is also consistent with research: when hierarchies are noisy or incomplete, flat classifiers often outperform hierarchical ones. Hierarchical methods help most when the structure is clean and reliable. (Science Direct)
4. Overall recommendation for your project
Given your numbers and constraints, here is a simple, robust plan:
4.1 Main BC classifier
- Keep your TF-IDF (char 3–5 + word 1–2) + LinearSVC as the main BC model. It’s already performing very well.
- Use your existing calibration (prefit) to turn scores into approximate probabilities; it’s OK that calibration is imperfect for rare BCs, you can still use it for thresholds. (MachineLearningMastery.com)
4.2 Semantic backup / RAG
Then:
4.3 SetFit usage
For SetFit runs:
- Always use one canonical BC mapping so label indices match.
- Balance the training dataset using SetFit’s sampling utilities (same number of examples per label where possible). (GitHub)
- Log prediction collapse metrics (majority prediction fraction, # of zero-recall labels) as part of your evaluation.
4.4 HiClass usage
5. A short mental checklist you can reuse
When you see “model collapses to one class”:
-
Check label counts → are there many classes with 1–3 samples?
-
Check predictions → what fraction of examples get the majority predicted BC?
-
Check per-class recall → how many labels never get predicted?
-
If yes to all:
- reconsider training method for such long-tailed data (e.g., collapse tails, use retrieval, keep encoder frozen, etc.). (CVF Open Access)
You are already on a very good path. The main work now is not “find a magical new model,” but stabilize the system around your strong baseline, handle the long tail pragmatically, and use RAG/nearest-neighbour for explanation and tail BC support.