I was doing some research on how to improve my classification model (which currently uses text-embedding-ada-002), when I stumbled across the MTEB leaderboard.
For some additional context, I’m training a random forest machine learning model to solve a classification problem.
The input feature of the random forest model is a string for the equipments description: e.g. Equipment Description: “Area Name: PASTE PLANT, Equipment Group: CONVEYOR, Equipment Name: Conveyor Mixer”. And the output feature is criticality classification: e.g low, medium, high, severe. I am training the machine learning model on an existing dataset, and then using it to extrapolate for other assets.
Given there is no consistency of the input features (syntax and terminology changes), I’ve found by using a LLM to convert the equipment descriptions semantic meaning into vector embeddings, and then training the random forest machine learning model with those vector embeddings as features (after PCA dimensionality reduction) is working extremely well.
Question: Which class of Hugging Face LLM models should I try instead of OpenAI’s text-embedding-ada-002 model? I see there are tabs for Classification and Pair Classification. Any other’s I should try?
Question: What would you suggest I try for the task_objective? I see the examples given for task objective are Represent This Sentence, or Represent This Document for Retrieval.
Question: Could this problem be solved by fine-tuning a LLM and avoiding the need to use a machine learning model like random forest? I did try to do this with GPT-3.5 finetuning, which was recently released, however the results were very average and the random forest models performance was better.
I’m very experienced with using OpenAI’s range of models, however very new to Hugging Face and I’d love to learn more and explore what options Hugging Face has to offer.
Thank you kindly for any feedback.
Thanks in advance.