Please read the topic category description to understand what this is all about
Description
Twitter classifies trending tweets according to predefined set of topics like “Data science”, “Hip hop”, “Sport” etc. However, their algorithm appears to often get confused by certain keywords in the tweet or the content of the image (see here for some funny examples).
The goal of this project is to explore whether it’s possible to create a better topic extractor, or at least one that is more targeted for a smaller set of domains. There’s several ways to approach to task:
- Frame it as a multiclass classification problem
- Frame it as an unsupervised clustering problem, and combine this with techniques like UMAP and/or HDBSCAN
Model(s)
Since Tweets are short in length, picking one of the sentence-transformers
models on the Hub is likely a good place to start.
Datasets
There are various Twitter datasets on the Hub and here’s a few examples to start with:
Challenges
If you take the unsupervised learning approach, be warned that there’s no “correct” answer and you will have to experiment with the dimensionality reduction / cluster algorithms to get meaningful clusters.
Desired project outcomes
- Create a Streamlit or Gradio app on Spaces that either can automatically classify a Tweet according to a topic, or visualises the 2D projection of the embeddings and colours them by topic.
- Don’t forget to push all your models and datasets to the Hub so others can build on them!
Additional resources
- https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668
- https://huggingface.co/spaces/edugp/embedding-lenses (useful for UMAP inspiration)
Discord channel
To chat and organise with other people interested in this project, head over to our Discord and:
-
Follow the instructions on the
#join-course
channel -
Join the
#twitter-topic-extractor
channel
Just make sure you comment here to indicate that you’ll be contributing to this project