I have attempted to create a BERT Job Title Classifier based off some other examples. Colab can be viewed here:
https://colab.research.google.com/drive/1EXKrNIIaVLR9cnpysJCS-NnciksgMoZC?usp=sharing
When I train the model on a small 1k entry dataset such as
https://drive.google.com/uc?export=download&id=1Q29nC9Y1x6QGwSiKfu2ie45RG_p-BQ7k
I get expected results from my small test case.
programmer: Information & Communication Technology - Engineering - Software
builder: Manufacturing, Transport & Logistics - Machine Operators
mechanic: Trades & Services - Automotive Trades
office assistant: Administration & Office Support - Administrative Assistants
welder: Trades & Services - Welders & Boilermakers
However, If I train the model on a larger dataset containing 10K entries such as
https://drive.google.com/uc?export=download&id=1eFMpgiSmhsNoZCBPrwlJlPZ6Lp_Bqw2Y
My small test case gives incorrect results
programmer: Administration & Office Support - Administrative Assistants
builder: Administration & Office Support - Administrative Assistants
mechanic: Administration & Office Support - Administrative Assistants
office assistant: Administration & Office Support - Administrative Assistants
welder: Administration & Office Support - Administrative Assistants
So at this stage I am confused if the issue is in the code or the dataset.
I am currently training the model using the sub_classification_id. Should I instead train the model on the parent classification_id, then perform further analysis to determine the sub_classification_id?
If anyone could offer any guidance it would be highly appreciated.