How to train a text-based classifier to predict parent module (BC) from functional component names and simple descriptions?

Hi everyone,

I am working on a classification problem in the automotive domain.
The goal is to predict which Base Component (BC) a new Functional Component (FC) belongs to.

Each FC has:

  • a short name (abbreviated identifier)

  • a long name (short English description)

  • known hierarchy fields like AC, MC, GC, etc. (some may be empty)

The task:

Given the short FC name and its long name, predict the BC it belongs to.

The goal is to train a model to:

Take FC short name + long name as input and predict the BC it belongs to.

My training data has a hierarchical structure: AC → BC → MC → GC → FC, but for prediction I only have access to the FC name and description.

Additional Context:

  • I have a glossary mapping abbreviations to full terms (e.g., “Accr” → “Accelerator”, “MoF” → “Function Monitoring”)

  • Technical domain: Automotive software architecture

  • The naming follows specific conventions with underscores, prefixes, and technical abbreviations

Questions:

  1. What’s the best approach for this hierarchical classification with mixed naming patterns?

  2. Should I use transformer models or traditional ML with feature engineering?

  3. Any experience with similar technical domain classification problems?

Dataset Structure

AC → Top-level application layer
├── BC → Main functional subsystem [TARGET LABEL]
├── MC→ Optional submodule layer
├── GC → Optional function grouping
└── FC (Functional Component) → Specific function [INPUT]

An example of how my actual data is like(this is not my real data)

AC BC MC GC FC FC_Description

AppLayer - - - - Application Layer

AppLayer AppSup - - - Application Supervisor

AppLayer AppSup EngSync - - Engine Synchronization Controller

AppLayer AppSup EngSync - EngSync_Adapter Software Adapter Component

AppLayer AppSup EngSync - EngSync_CamEvtCfg Camshaft Event Configuration Module

AppLayer AppSup EngSync - EngSync_Monitor Engine Sync Controller Monitoring

AppLayer AppSup EngSync - EngSync_TaskHandler Task Activation Handler

AppLayer AppSup HiSvc - - High-Level Service Library

AppLayer AppSup HiSvc - HiSvc_Library High-Level Service Library

AppLayer AppSup HiSvc SecComm SecComm_Adapter Secure Communication Adapter

AppLayer AppSup PwrMod - - Power Mode Manager

AppLayer AppSup PwrMod PwrCoord - Power Mode Coordinator Functions

AppLayer AppSup PwrMod PwrCoord PwrCoord_Periph Power Mode Coordinator Peripherals

AppLayer AppSup SafeFunc FuncMon - Function Monitoring

AppLayer AppSup SafeFunc FuncMon FuncMonAccel_BrkCheck Accelerator Brake Plausibility Check

AppLayer AppSup SafeFunc FuncMon FuncMonAir_AddFunc Additional Air Monitoring Function

AppLayer AppSup SafeFunc FuncMon FuncMonAir_Charge Relative Air Charge Monitoring

AppLayer AppSup SafeFunc FuncMon FuncMonBrake Brake System Monitoring

AppLayer AppSup SafeFunc FuncMon FuncMonBrake_Hardware Brake Hardware Input Monitoring

AppLayer AppSup SafeFunc FuncMon FuncMonBrake_System Brake System Safety Monitor

AppLayer AppSup SafeFunc FuncMon FuncMonBase_Speed Speed Signal Monitoring

  1. MC and GC columns can often be empty.
  2. Some FCs start with their BC name, others start with GC or MC identifiers, and a few start with completely unrelated prefixes.
  3. FC long names describe functionality

The FC names follow different patterns:

  1. Direct BC naming: starts directly with BC name

  2. MC-based naming: starts directly with MC name

  3. GC-based naming

  4. Mixed patterns: fc name starts with completely different identifer

Available Resources:

  • The dataset includes the following columns: AC, BC, MC, GC, FC, and FC task — where FC is the structured short identifier, FC task is its descriptive English text, and BC is the target label to be predicted.

  • A complete glossary mapping abbreviations to full terms.

I’m currently unsure what preprocessing or embedding strategy would work best to identify the correct BC for a new FC. Could anyone please guide me on the right process or steps to follow for this type of structured + text-based classification problem?

After this part, my mentor also wants me to build a RAG (Retrieval-Augmented Generation) pipeline on top of it so if you could also suggest any pipeline or architecture that would fit this kind of dataset (even a simple one), that would be amazing. I am lost and any pointers or any example workflows would really help me move forward.

2 Likes

Hmm, not really familiar with this topic. Cases where you’d use SetFit…?

1 Like

Here is the update:

Original TF-IDF baseline (FC Long Name only):
Test = 57.81%, Val = 58.28%

Normalized TF-IDF baseline
(FC_RAW + split tokens + glossary expansions + description):
Test = 72.99%

Strong baseline (char 3–5 + word 1–2 TF-IDF, class-weighted LinearSVC + calibrated probabilities)
Top-1 = 81.74%

Top-3 = 92.66% on the test set.
Calibration used the prefit fallback because CV=3 didn’t have enough samples for rare BCs.

I am sorry for asking so many naive questions but I have few doubts and need your help with these:

  1. My SetFit model is repeatedly collapsing into a single dominant class. What does your recommended pipeline use to detect label imbalance or class-frequency collapse during encoder fine-tuning?
  2. My BC distribution is long-tailed; many labels have less than 3 examples, which seems to destabilize contrastive pairing. Which strategy do I use for SetFit fine-tuning? Also SetFit embeddings were exported with 631 BC classes, but y_true has 562 BC classes, causing a label-index mismatch and probability misalignment. What do I do?
  3. My HiClass collapses to around 45% accuracy because MC class is 52% complete and GC is 58% complete. “UNKNOWN_MC” and “UNKNOWN_GC” nodes dominate internal classifiers and break the tree. It is not necessary for MC and GC to always exist. How do I treat missing MC/GC ? Or is it better to stop using this hierarchy modeling?
1 Like

Hm. The results are exemplary so far, and the issues that have arisen appear to be within the expected range.


I’ll go over each of your three issues slowly, with context and concrete “do this” recommendations.

You’re already doing very well:
– Flat TF-IDF + LinearSVC: 81.7% Top-1, 92.7% Top-3 on 500+ BC classes in a long-tailed, technical domain. That’s strong.

The problems you’re seeing with SetFit and HiClass are exactly what you’d expect in:

  • a long-tailed label distribution (many rare BCs), and
  • a partially missing hierarchy (MC/GC often unknown).

Long-tailed and imbalanced setups are known to break standard training if you don’t adapt them. (arXiv)

I’ll answer your questions in order and keep the language as simple as possible.


1. SetFit keeps collapsing into one dominant class

1.1 Why this is happening (intuitive explanation)

SetFit has two stages:

  1. Contrastive fine-tuning of an encoder

    • It forms positive pairs: two examples with the same label.
    • It forms negative pairs: two examples with different labels.
    • It then moves embeddings of positives closer, negatives further apart.
  2. A simple classifier head on top of these embeddings (like logistic regression). (GitHub)

In a long-tailed BC distribution:

  • Head BCs (common labels) have many samples → many positive pairs.
  • Tail BCs (rare labels) have 1–3 samples → almost no positive pairs, sometimes none.

The contrastive loss therefore:

  • Learns a good cluster for head classes.
  • Has almost no signal to carve out tail classes.
  • The classifier head then learns “If I must guess, predicting the big head class gives me good accuracy,” and it collapses into predicting that class most of the time.

This behaviour is exactly what people report when using SetFit on many imbalanced classes (see GitHub discussion where users describe poor results on 90+ very imbalanced classes). (GitHub)

So nothing is “wrong” with your code. It’s mostly a mismatch between SetFit’s assumptions and your data shape.

1.2 How to detect this collapse systematically

You can build the detection directly into your pipeline.

A. Before training (data check)

Compute:

  • For each BC i, its count n_i.
  • Summary: min, max, median, and the fraction of labels with n_i < 3.

If a big fraction of BCs have <3 examples, you are firmly in long-tailed territory; standard methods will struggle without special handling. (arXiv)

B. After training (prediction check)

On a validation set:

  1. Predicted label distribution

    • Count how many predictions fall into each BC.

    • Compute:

      • dominant_fraction = (max count of any predicted BC) / (total predictions)
    • If dominant_fraction is very high (e.g. > 0.5–0.7), the model is effectively predicting one label for most examples.

  2. Per-class recall

    • Use something like classification_report (sklearn).
    • Look at how many BCs have Recall = 0.0.
    • If many labels never get predicted at all, your model is ignoring a large part of the taxonomy.
  3. Entropy of predictions (optional)

    • Compute the entropy of the empirical predicted label distribution.
    • If it is much lower than the entropy of the true label distribution, that’s another sign of over-concentration on a few classes.

These checks give you an automatic “collapse detector” in your training loop. When you see collapse, you don’t promote that model.


2. Long-tailed BCs + SetFit + label mismatch (631 vs 562)

There are two different issues here:

  1. Long-tailed distribution → hard for SetFit / contrastive training.
  2. Label mapping mismatch → 631 BCs in the SetFit classifier, but 562 in y_true.

Let’s fix the structural problem first (label mapping), then talk about how to use SetFit (or not) in a long-tailed scenario.

2.1 Fixing the label-index mismatch

A mismatch like “631 BCs in model, 562 in y_true” usually means:

  • You built different label vocabularies in different parts of the pipeline:

    • For example, SetFit saw some additional BCs that you later filtered out for evaluation, or you built the label encoder independently in multiple scripts.

What to do:

  1. Define one canonical mapping from BC string → integer ID.

    unique_bcs = sorted(set(all_bc_labels_you_want_to_model))
    label2id = {bc: i for i, bc in enumerate(unique_bcs)}
    id2label = {i: bc for bc, i in label2id.items()}
    
  2. Apply this same mapping consistently for:

    • TF-IDF baseline (if it uses integer labels),
    • SetFit training datasets,
    • any embedding exporter,
    • any RAG or k-NN classification logic.
  3. Filter out BCs you don’t want before building the mapping:

    • If you decide “we do not model BCs with <2 examples,” remove those rows entirely before computing unique_bcs.
    • Then SetFit, TF-IDF, etc. all see exactly the same 562 labels.

Once this is done, there should be no more 631 vs 562 issue: your classifier head size will match your y_true label space.

2.2 What to do with BCs that have <3 examples

This is the long-tail issue. Modern long-tailed learning papers basically say:

  • You can’t expect normal supervised learning to separate hundreds of classes when many have 1–3 samples. You must adjust training or adjust the label space. (arXiv)

Here are practical strategies tailored to your use case.

Strategy A: Collapse rare BCs into “OTHER” + use retrieval for the tail

  1. Pick a threshold T (e.g. 3 or 4):

    • head_BCs = {BC | count(BC) >= T}
    • tail_BCs = the rest.
  2. In training labels:

    • Keep each head BC as itself.
    • Map all tail BCs to a special label "BC_OTHER".
  3. Train SetFit (or any classifier) over this reduced label space:

    • head_BCs + "BC_OTHER".
  4. At inference:

    • If the classifier predicts a specific head BC, use it.

    • If it predicts "BC_OTHER", then:

      • Use a nearest-neighbor / RAG step:

        • Embed the FC text (with e.g. E5/BGE, even without SetFit finetuning).
        • Retrieve nearest training FCs from all BCs (including the tail ones).
        • Inspect their BC labels → suggest the most supported BCs, or present them to a human.

This hybrid “head = classifier, tail = retrieval” pattern is widely used in long-tailed setups. (SpringerLink)

Strategy B: Don’t fine-tune the encoder; use frozen embeddings + simple classifiers

Because your TF-IDF baseline is very strong already, you can do something simpler than SetFit:

  • Use a pre-trained encoder like E5-base or BGE-base. (GitHub)

  • Do not fine-tune it contrastively on your long-tailed BC dataset.

  • Instead, compute embeddings once and:

    1. Train a logistic regression / linear SVM head on the embeddings, with class weights or cost-sensitive loss if needed. (labelyourdata.com)

    2. Or use a nearest-centroid classifier:

      • For each BC: centroid = average embedding of its FCs.
      • For a new FC: encode, then choose the nearest centroid.
      • This is simple and robust, and it does not require balanced pairs.

Because the encoder is fixed, it is not dominated by the head BC during training. You’re reusing the general semantic space of the pretrained model and just mapping it to your BCs.

This is often more stable than trying to fine-tune with heavily imbalanced SetFit pairs.

Strategy C: Use SetFit only on a smaller, more balanced sub-problem

If you still want to use SetFit in some way:

  • Restrict it to a subset of BCs where you have enough examples (e.g. top 50–100 BCs).
  • Use SetFit to build a very good classifier for these head classes (and maybe “OTHER”).
  • Use different mechanisms (TF-IDF, k-NN, RAG) for the tails.

In all cases, keep that single canonical BC mapping so that label indices match across components.


3. HiClass drops to ~45% because MC/GC are often missing

You noticed:

MC is 52% complete and GC is 58% complete. UNKNOWN_MC and UNKNOWN_GC dominate and break the tree.

This is a classic failure mode of local hierarchical classifiers when many internal levels are “unknown”.

3.1 How HiClass works and why UNKNOWN dominates

HiClass implements several local hierarchical strategies (Local Classifier Per Node, Per Parent Node, Per Level). (arXiv)

In the Local Classifier Per Parent Node design (the one you’re likely using):

  • For each parent node in the tree, HiClass trains a separate multi-class classifier to decide which child to go to. (hiclass.readthedocs.io)
  • E.g. at the BC level, a classifier picks BC1 vs BC2 vs ....
  • At the MC level, a classifier picks among MC children (including any “UNKNOWN_MC”).

Now imagine at some parent node:

  • 60% of the training samples go to "UNKNOWN_MC".
  • The rest are split across many “real” MCs.

The local classifier will naturally learn:

Most of the time, choose "UNKNOWN_MC"; that gives high accuracy at that node.

Once this happens near the top of the path, almost all samples are routed into "UNKNOWN_MC" and never reach the deeper, more specific children. The lower levels become effectively useless for predictions, and your overall tree performance collapses.

This is the behaviour you’re seeing.

3.2 How to treat missing MC/GC (and what you actually need)

Your real business need is:

  • Given FC short + long name, predict BC.

You do not actually need MC and GC to be predicted. They happen to exist in the data, but they are:

  • incomplete (about half missing), and
  • unstable for a local hierarchical training design.

Given that:

Best practical choice right now:

Stop trying to model MC and GC in the hierarchy.
Use a flat BC classifier (the one you already have) as your main model.

You can still use MC/GC as text features:

  • For training examples where MC/GC exist, include them in the input text string that you TF-IDF / embed:

    • "AC: AppLayer | BC: AppSup | MC: EngSync | GC: FuncMon | FC: EngSync_TaskHandler | DESC: Task Activation Handler"
  • For missing MC/GC, you either omit them or write "MC: None" just as text. This does not create a tree node — it is just additional context for the model if available.

If you still want some hierarchy:

  1. Keep only AC → BC in the label path.

    • AC is probably much more complete than MC/GC.
    • Build labels like ["AppLayer", "AppSup"].
    • Train HiClass on this two-level hierarchy only.
    • This is much more stable and closer to your actual use case. (arXiv)
  2. Do not create explicit "UNKNOWN_MC" or "UNKNOWN_GC" nodes.

    • If a path ends at BC, you simply represent it as [AC, BC] without deeper nodes. HiClass can handle paths of different lengths; it pads them internally, you don’t need to turn missing levels into children. (arXiv)
  3. Decide based on numbers:

    • If AC→BC with HiClass doesn’t beat your flat BC model by a clear margin, stick to the simpler flat model.

This is also consistent with research: when hierarchies are noisy or incomplete, flat classifiers often outperform hierarchical ones. Hierarchical methods help most when the structure is clean and reliable. (Science Direct)


4. Overall recommendation for your project

Given your numbers and constraints, here is a simple, robust plan:

4.1 Main BC classifier

  • Keep your TF-IDF (char 3–5 + word 1–2) + LinearSVC as the main BC model. It’s already performing very well.
  • Use your existing calibration (prefit) to turn scores into approximate probabilities; it’s OK that calibration is imperfect for rare BCs, you can still use it for thresholds. (MachineLearningMastery.com)

4.2 Semantic backup / RAG

  • Use a frozen encoder like E5-base or BGE-base to embed FC texts (no SetFit fine-tuning across all 600 BCs). (GitHub)

  • Build:

    • A nearest-centroid / k-NN classifier over those embeddings, and/or
    • A retrieval index (BM25 + FAISS) to retrieve similar FCs and BC cards.

Then:

  • If TF-IDF is confident → trust TF-IDF.

  • If TF-IDF is low-confidence → consult the k-NN / RAG neighbours to:

    • refine the BC prediction, or
    • present top-k BC candidates + similar FC examples to a human.

4.3 SetFit usage

  • Use SetFit only on smaller, more balanced sub-problems, for example:

    • head BCs only; or
    • grouping BCs into 10 “super-BC” families.

For SetFit runs:

  • Always use one canonical BC mapping so label indices match.
  • Balance the training dataset using SetFit’s sampling utilities (same number of examples per label where possible). (GitHub)
  • Log prediction collapse metrics (majority prediction fraction, # of zero-recall labels) as part of your evaluation.

4.4 HiClass usage

  • Do not model MC/GC in the hierarchy right now.

  • If you want a hierarchical model:

    • Use only AC→BC in the label tree.
    • Compare with the flat baseline.
    • If the gain is small or negative, keep the flat model and use hierarchy only as metadata.

5. A short mental checklist you can reuse

When you see “model collapses to one class”:

  1. Check label counts → are there many classes with 1–3 samples?

  2. Check predictions → what fraction of examples get the majority predicted BC?

  3. Check per-class recall → how many labels never get predicted?

  4. If yes to all:

    • reconsider training method for such long-tailed data (e.g., collapse tails, use retrieval, keep encoder frozen, etc.). (CVF Open Access)

You are already on a very good path. The main work now is not “find a magical new model,” but stabilize the system around your strong baseline, handle the long tail pragmatically, and use RAG/nearest-neighbour for explanation and tail BC support.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.