How to train a text-based classifier to predict parent module (BC) from functional component names and simple descriptions?

Hi everyone,

I am working on a classification problem in the automotive domain.
The goal is to predict which Base Component (BC) a new Functional Component (FC) belongs to.

Each FC has:

  • a short name (abbreviated identifier)

  • a long name (short English description)

  • known hierarchy fields like AC, MC, GC, etc. (some may be empty)

The task:

Given the short FC name and its long name, predict the BC it belongs to.

The goal is to train a model to:

Take FC short name + long name as input and predict the BC it belongs to.

My training data has a hierarchical structure: AC → BC → MC → GC → FC, but for prediction I only have access to the FC name and description.

Additional Context:

  • I have a glossary mapping abbreviations to full terms (e.g., “Accr” → “Accelerator”, “MoF” → “Function Monitoring”)

  • Technical domain: Automotive software architecture

  • The naming follows specific conventions with underscores, prefixes, and technical abbreviations

Questions:

  1. What’s the best approach for this hierarchical classification with mixed naming patterns?

  2. Should I use transformer models or traditional ML with feature engineering?

  3. Any experience with similar technical domain classification problems?

Dataset Structure

AC → Top-level application layer
├── BC → Main functional subsystem [TARGET LABEL]
├── MC→ Optional submodule layer
├── GC → Optional function grouping
└── FC (Functional Component) → Specific function [INPUT]

An example of how my actual data is like(this is not my real data)

AC BC MC GC FC FC_Description

AppLayer - - - - Application Layer

AppLayer AppSup - - - Application Supervisor

AppLayer AppSup EngSync - - Engine Synchronization Controller

AppLayer AppSup EngSync - EngSync_Adapter Software Adapter Component

AppLayer AppSup EngSync - EngSync_CamEvtCfg Camshaft Event Configuration Module

AppLayer AppSup EngSync - EngSync_Monitor Engine Sync Controller Monitoring

AppLayer AppSup EngSync - EngSync_TaskHandler Task Activation Handler

AppLayer AppSup HiSvc - - High-Level Service Library

AppLayer AppSup HiSvc - HiSvc_Library High-Level Service Library

AppLayer AppSup HiSvc SecComm SecComm_Adapter Secure Communication Adapter

AppLayer AppSup PwrMod - - Power Mode Manager

AppLayer AppSup PwrMod PwrCoord - Power Mode Coordinator Functions

AppLayer AppSup PwrMod PwrCoord PwrCoord_Periph Power Mode Coordinator Peripherals

AppLayer AppSup SafeFunc FuncMon - Function Monitoring

AppLayer AppSup SafeFunc FuncMon FuncMonAccel_BrkCheck Accelerator Brake Plausibility Check

AppLayer AppSup SafeFunc FuncMon FuncMonAir_AddFunc Additional Air Monitoring Function

AppLayer AppSup SafeFunc FuncMon FuncMonAir_Charge Relative Air Charge Monitoring

AppLayer AppSup SafeFunc FuncMon FuncMonBrake Brake System Monitoring

AppLayer AppSup SafeFunc FuncMon FuncMonBrake_Hardware Brake Hardware Input Monitoring

AppLayer AppSup SafeFunc FuncMon FuncMonBrake_System Brake System Safety Monitor

AppLayer AppSup SafeFunc FuncMon FuncMonBase_Speed Speed Signal Monitoring

  1. MC and GC columns can often be empty.
  2. Some FCs start with their BC name, others start with GC or MC identifiers, and a few start with completely unrelated prefixes.
  3. FC long names describe functionality

The FC names follow different patterns:

  1. Direct BC naming: starts directly with BC name

  2. MC-based naming: starts directly with MC name

  3. GC-based naming

  4. Mixed patterns: fc name starts with completely different identifer

Available Resources:

  • The dataset includes the following columns: AC, BC, MC, GC, FC, and FC task — where FC is the structured short identifier, FC task is its descriptive English text, and BC is the target label to be predicted.

  • A complete glossary mapping abbreviations to full terms.

I’m currently unsure what preprocessing or embedding strategy would work best to identify the correct BC for a new FC. Could anyone please guide me on the right process or steps to follow for this type of structured + text-based classification problem?

After this part, my mentor also wants me to build a RAG (Retrieval-Augmented Generation) pipeline on top of it so if you could also suggest any pipeline or architecture that would fit this kind of dataset (even a simple one), that would be amazing. I am lost and any pointers or any example workflows would really help me move forward.

2 Likes

Hmm, not really familiar with this topic. Cases where you’d use SetFit…?

1 Like