I am on a quest to effectively categorize a set of 10,000 new products into existing categories, using a machine learning approach. I have a dataset for training and several strategy options on hand, ranging from classic ML techniques, to transformer-based embeddings, and utilizing pre-trained language models. But, I’m unsure about the optimal route to take. Can you help me decide?
I have a dataset of 1000 products, each assigned to one of roughly 100 categories. The data for each product includes its name, description, and price. I now wish to categorize an additional 10K products with similar data using a robust and reliable method. Each product should be assigned to a single category.
In a traditional machine learning approach, I might use libraries such as Spacy or NLTK to represent each product as a bag of words, train a classifier on this representation, then apply the classifier to the new catalog.
Alternatively, within the HuggingFace ecosystem, I could use a transformer to represent each product as a vector, which can subsequently be used with traditional machine learning methods. Or, I could directly apply a pre-trained model from HuggingFace.
Considering the above, I have the following questions:
-
Which of the three mentioned approaches would you recommend trying first: classic ML, transformer-generated embeddings with classical ML, or a pre-trained Language Model?
-
If you recommend the third approach (i.e., using a pre-trained model), can you suggest specific HuggingFace models suitable for this task?
Any insights or recommendations would be greatly appreciated.
Thank you!