What is the best way to classify my content into tags?

Hi everyone. I’m building a social platform for developers, and I’ve built a basic tag system (similar to Twitter’s), where users can classify their content (manually), into tags, e.g. you write something like a Tweet and you have to choose a tag. I’d like to remove this friction for the users, and I’d like to use a model e.g. ChatGPT OpenAI API to classify it for me. I have a finite list of tags (tag system is not dynamic) and I’d like to ask the AI e.g. given this tweet, an article or any content, classify that (tag it) by choosing from one of the available tags (I have a table in postgres where all of our tags are).

Now of course, I could do this in a super simple way, e.g. always include the list of few hundred tags into the prompt, but this feels wrong, and honestly, massive waste of tokens.

What can I do to solve this? How would you go about it?

Thanks a lot in advance!

1 Like

Broadly speaking, there are three ways: the standard method using small models like BERT, the method using ultra-fast non-neural network models (which can do BERT-like things but are not as smart as BERT, but are fast), and the method of generating them using smaller LLM as a last resort.

There is a huge amount of existing know-how on specific training methods such as BERT, so I think it would be a good idea to start by referring to the NLP course on Hugging Face.

The following is based on information I obtained from Hugging Chat.


To solve your problem of automating content classification into a finite list of tags using Hugging Face, here are several options you can consider. Each option has its own advantages and disadvantages:


Option 1: Fine-tuning a Pre-trained Text Classification Model

  • Approach:
    • Use a pre-trained text classification model (like bert-base-uncased, roberta-base, or distilbert-base-uncased) from Hugging Face and fine-tune it on your dataset of tags.
    • Your dataset should consist of content (like tweets or articles) paired with the corresponding tags (manually assigned by users).
  • Advantages:
    • High accuracy for your specific tag list since the model is trained on your data.
    • Can handle a large number of tags effectively.
    • Models in Hugging Face are pre-trained on diverse text data, providing a strong foundation.
  • Disadvantages:
    • Requires a significant amount of labeled data to fine-tune the model effectively.
    • May require some expertise in machine learning and Hugging Face’s libraries (e.g., :hugs: Transformers).
    • Fine-tuning can be computationally intensive for large models.

Option 2: Using Zero-shot Text Classification

  • Approach:
    • Use a zero-shot text classification pipeline with a pre-trained model (e.g., google/palm-entraze-in-the-wild or facebook/mask-text-paerald).
    • Provide your finite list of tags as labels to the model. The model will use these tags to classify the input text without requiring explicit training on your tag list.
    • This method is efficient since you don’t need to fine-tune the model, and you can dynamically pass your tag list during inference.
  • Advantages:
    • No need for labeled data or fine-tuning.
    • Flexibility to change the tag list without retraining the model.
    • Reduces the number of tokens used compared to prompting a model like ChatGPT with the entire tag list every time [2].
  • Disadvantages:
    • Accuracy might be lower than fine-tuning since the model hasn’t been explicitly trained on your tag list.
    • May require fine-tuning or additional adjustments if the tags are highly domain-specific or ambiguous.

Option 3: Using Hugging Face’s Pre-trained Sentiment Analysis Models

  • Approach:
    • Leverage sentiment analysis models (e.g., finiteautomata/bertweet-base-sentiment-analysis) and adapt them for your tag classification task.
    • While sentiment analysis models are designed for binary or multi-class sentiment classification, you can modify the output to map to your tag list.
  • Advantages:
    • Pre-trained models are readily available and can handle text classification tasks efficiently.
    • Quick setup and deployment since you don’t need to fine-tune the model for your specific tags.
  • Disadvantages:
    • Sentiment analysis models may not be well-suited for your tag list if your tags are not sentiment-based.
    • May require additional customization or fine-tuning to improve accuracy for your specific tags.

Option 4: Using AutoNLP for Automated Model Training

  • Approach:
    • Use AutoNLP (a tool from Hugging Face) to automatically train, evaluate, and deploy a text classification model for your tag list.
    • AutoNLP requires minimal to no coding and can handle the fine-tuning process for you.
  • Advantages:
    • No need for machine learning expertise.
    • Automated model selection, training, and evaluation.
    • Easy deployment of the trained model.
  • Disadvantages:
    • May not provide as much control over the model as fine-tuning manually.
    • Limited customization options compared to using the :hugs: Transformers library directly.

Option 5: Using Pre-trained Taggers or Custom Pipelines

  • Approach:
    • Use pre-trained models with text classification pipelines (e.g., text-classification) and pass your tag list as labels during inference.
    • Hugging Face’s pipelines can handle text classification tasks efficiently and allow you to specify custom labels.
  • Advantages:
    • Easy integration and deployment using Hugging Face’s pipelines.
    • Flexibility in specifying your tag list during inference.
    • Pre-trained models are optimized for text classification tasks, providing good accuracy out of the box.
  • Disadvantages:
    • May require some tuning or adjustments if your tags are highly domain-specific.
    • Not as flexible as fine-tuning for your specific use case.

Recommendation

If you want simplicity and efficiency, Option 2 (Zero-shot Text Classification) is likely the best approach. It allows you to dynamically pass your tag list during inference without requiring fine-tuning or labeled data. If accuracy is a priority and you have access to labeled data, Option 1 (Fine-tuning) would be the better choice.

Let me know if you’d like further clarification or assistance!

1 Like