Hi everyone,
I am working on a classification problem in the automotive domain.
The goal is to predict which Base Component (BC) a new Functional Component (FC) belongs to.
Each FC has:
-
a short name (abbreviated identifier)
-
a long name (short English description)
-
known hierarchy fields like AC, MC, GC, etc. (some may be empty)
The task:
Given the short FC name and its long name, predict the BC it belongs to.
The goal is to train a model to:
Take FC short name + long name as input and predict the BC it belongs to.
My training data has a hierarchical structure: AC → BC → MC → GC → FC, but for prediction I only have access to the FC name and description.
Additional Context:
-
I have a glossary mapping abbreviations to full terms (e.g., “Accr” → “Accelerator”, “MoF” → “Function Monitoring”)
-
Technical domain: Automotive software architecture
-
The naming follows specific conventions with underscores, prefixes, and technical abbreviations
Questions:
-
What’s the best approach for this hierarchical classification with mixed naming patterns?
-
Should I use transformer models or traditional ML with feature engineering?
-
Any experience with similar technical domain classification problems?
Dataset Structure
AC → Top-level application layer
├── BC → Main functional subsystem [TARGET LABEL]
├── MC→ Optional submodule layer
├── GC → Optional function grouping
└── FC (Functional Component) → Specific function [INPUT]
An example of how my actual data is like(this is not my real data)
AC BC MC GC FC FC_Description
AppLayer - - - - Application Layer
AppLayer AppSup - - - Application Supervisor
AppLayer AppSup EngSync - - Engine Synchronization Controller
AppLayer AppSup EngSync - EngSync_Adapter Software Adapter Component
AppLayer AppSup EngSync - EngSync_CamEvtCfg Camshaft Event Configuration Module
AppLayer AppSup EngSync - EngSync_Monitor Engine Sync Controller Monitoring
AppLayer AppSup EngSync - EngSync_TaskHandler Task Activation Handler
AppLayer AppSup HiSvc - - High-Level Service Library
AppLayer AppSup HiSvc - HiSvc_Library High-Level Service Library
AppLayer AppSup HiSvc SecComm SecComm_Adapter Secure Communication Adapter
AppLayer AppSup PwrMod - - Power Mode Manager
AppLayer AppSup PwrMod PwrCoord - Power Mode Coordinator Functions
AppLayer AppSup PwrMod PwrCoord PwrCoord_Periph Power Mode Coordinator Peripherals
AppLayer AppSup SafeFunc FuncMon - Function Monitoring
AppLayer AppSup SafeFunc FuncMon FuncMonAccel_BrkCheck Accelerator Brake Plausibility Check
AppLayer AppSup SafeFunc FuncMon FuncMonAir_AddFunc Additional Air Monitoring Function
AppLayer AppSup SafeFunc FuncMon FuncMonAir_Charge Relative Air Charge Monitoring
AppLayer AppSup SafeFunc FuncMon FuncMonBrake Brake System Monitoring
AppLayer AppSup SafeFunc FuncMon FuncMonBrake_Hardware Brake Hardware Input Monitoring
AppLayer AppSup SafeFunc FuncMon FuncMonBrake_System Brake System Safety Monitor
AppLayer AppSup SafeFunc FuncMon FuncMonBase_Speed Speed Signal Monitoring
- MC and GC columns can often be empty.
- Some FCs start with their BC name, others start with GC or MC identifiers, and a few start with completely unrelated prefixes.
- FC long names describe functionality
The FC names follow different patterns:
-
Direct BC naming: starts directly with BC name
-
MC-based naming: starts directly with MC name
-
GC-based naming
-
Mixed patterns: fc name starts with completely different identifer
Available Resources:
-
The dataset includes the following columns: AC, BC, MC, GC, FC, and FC task — where FC is the structured short identifier, FC task is its descriptive English text, and BC is the target label to be predicted.
-
A complete glossary mapping abbreviations to full terms.
I’m currently unsure what preprocessing or embedding strategy would work best to identify the correct BC for a new FC. Could anyone please guide me on the right process or steps to follow for this type of structured + text-based classification problem?
After this part, my mentor also wants me to build a RAG (Retrieval-Augmented Generation) pipeline on top of it so if you could also suggest any pipeline or architecture that would fit this kind of dataset (even a simple one), that would be amazing. I am lost and any pointers or any example workflows would really help me move forward.