Hi everyone,
I am working on a classification problem in the automotive domain.
The goal is to predict which Base Component (BC) a new Functional Component (FC) belongs to.
Each FC has:
- 
a short name (abbreviated identifier)
 - 
a long name (short English description)
 - 
known hierarchy fields like AC, MC, GC, etc. (some may be empty)
 
The task:
Given the short FC name and its long name, predict the BC it belongs to.
The goal is to train a model to:
Take FC short name + long name as input and predict the BC it belongs to.
My training data has a hierarchical structure: AC → BC → MC → GC → FC, but for prediction I only have access to the FC name and description.
Additional Context:
- 
I have a glossary mapping abbreviations to full terms (e.g., “Accr” → “Accelerator”, “MoF” → “Function Monitoring”)
 - 
Technical domain: Automotive software architecture
 - 
The naming follows specific conventions with underscores, prefixes, and technical abbreviations
 
Questions:
- 
What’s the best approach for this hierarchical classification with mixed naming patterns?
 - 
Should I use transformer models or traditional ML with feature engineering?
 - 
Any experience with similar technical domain classification problems?
 
Dataset Structure
AC     → Top-level application layer
├── BC         → Main functional subsystem [TARGET LABEL]
├── MC→ Optional submodule layer
├── GC  → Optional function grouping
└── FC (Functional Component) → Specific function [INPUT]
An example of how my actual data is like(this is not my real data)
AC BC MC GC FC FC_Description
AppLayer - - - - Application Layer
AppLayer AppSup - - - Application Supervisor
AppLayer AppSup EngSync - - Engine Synchronization Controller
AppLayer AppSup EngSync - EngSync_Adapter Software Adapter Component
AppLayer AppSup EngSync - EngSync_CamEvtCfg Camshaft Event Configuration Module
AppLayer AppSup EngSync - EngSync_Monitor Engine Sync Controller Monitoring
AppLayer AppSup EngSync - EngSync_TaskHandler Task Activation Handler
AppLayer AppSup HiSvc - - High-Level Service Library
AppLayer AppSup HiSvc - HiSvc_Library High-Level Service Library
AppLayer AppSup HiSvc SecComm SecComm_Adapter Secure Communication Adapter
AppLayer AppSup PwrMod - - Power Mode Manager
AppLayer AppSup PwrMod PwrCoord - Power Mode Coordinator Functions
AppLayer AppSup PwrMod PwrCoord PwrCoord_Periph Power Mode Coordinator Peripherals
AppLayer AppSup SafeFunc FuncMon - Function Monitoring
AppLayer AppSup SafeFunc FuncMon FuncMonAccel_BrkCheck Accelerator Brake Plausibility Check
AppLayer AppSup SafeFunc FuncMon FuncMonAir_AddFunc Additional Air Monitoring Function
AppLayer AppSup SafeFunc FuncMon FuncMonAir_Charge Relative Air Charge Monitoring
AppLayer AppSup SafeFunc FuncMon FuncMonBrake Brake System Monitoring
AppLayer AppSup SafeFunc FuncMon FuncMonBrake_Hardware Brake Hardware Input Monitoring
AppLayer AppSup SafeFunc FuncMon FuncMonBrake_System Brake System Safety Monitor
AppLayer AppSup SafeFunc FuncMon FuncMonBase_Speed Speed Signal Monitoring
- MC and GC columns can often be empty.
 - Some FCs start with their BC name, others start with GC or MC identifiers, and a few start with completely unrelated prefixes.
 - FC long names describe functionality
 
The FC names follow different patterns:
- 
Direct BC naming: starts directly with BC name
 - 
MC-based naming: starts directly with MC name
 - 
GC-based naming
 - 
Mixed patterns: fc name starts with completely different identifer
 
Available Resources:
- 
The dataset includes the following columns: AC, BC, MC, GC, FC, and FC task — where FC is the structured short identifier, FC task is its descriptive English text, and BC is the target label to be predicted.
 - 
A complete glossary mapping abbreviations to full terms.
 
I’m currently unsure what preprocessing or embedding strategy would work best to identify the correct BC for a new FC. Could anyone please guide me on the right process or steps to follow for this type of structured + text-based classification problem?
After this part, my mentor also wants me to build a RAG (Retrieval-Augmented Generation) pipeline on top of it so if you could also suggest any pipeline or architecture that would fit this kind of dataset (even a simple one), that would be amazing. I am lost and any pointers or any example workflows would really help me move forward.