Build a Twitter topic extractor

lewtun · November 10, 2021, 4:45pm

Please read the topic category description to understand what this is all about

Description

Twitter classifies trending tweets according to predefined set of topics like “Data science”, “Hip hop”, “Sport” etc. However, their algorithm appears to often get confused by certain keywords in the tweet or the content of the image (see here for some funny examples).

The goal of this project is to explore whether it’s possible to create a better topic extractor, or at least one that is more targeted for a smaller set of domains. There’s several ways to approach to task:

Frame it as a multiclass classification problem
Frame it as an unsupervised clustering problem, and combine this with techniques like UMAP and/or HDBSCAN

Model(s)

Since Tweets are short in length, picking one of the sentence-transformers models on the Hub is likely a good place to start.

Datasets

There are various Twitter datasets on the Hub and here’s a few examples to start with:

Challenges

If you take the unsupervised learning approach, be warned that there’s no “correct” answer and you will have to experiment with the dimensionality reduction / cluster algorithms to get meaningful clusters.

Desired project outcomes

Create a Streamlit or Gradio app on Spaces that either can automatically classify a Tweet according to a topic, or visualises the 2D projection of the embeddings and colours them by topic.
Don’t forget to push all your models and datasets to the Hub so others can build on them!

Additional resources

https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668
https://huggingface.co/spaces/edugp/embedding-lenses (useful for UMAP inspiration)

Discord channel

To chat and organise with other people interested in this project, head over to our Discord and:

Follow the instructions on the #join-course channel
Join the #twitter-topic-extractor channel

Just make sure you comment here to indicate that you’ll be contributing to this project

wilmerags · November 15, 2021, 8:11pm

I’d like to work on this project!

rg089 · November 16, 2021, 9:53pm

I’d like to join this project!

alghar · November 17, 2021, 12:27am

I am interested as well. What is your timeframe?

wilmerags · November 17, 2021, 1:27am

I think that the project needs to be available for reviewal in November 24th.

alghar · November 17, 2021, 5:29am

Hello Wilmer,

Would you like to have a zoom chat tomorrow (after the class)?

I am in Calif (-3 hours from you)
Al

wilmerags · November 19, 2021, 2:53am

Oh sorry didn’t see this, Yes, let’s catch up in discord

anupama4821 · March 7, 2023, 1:43pm

Interested

Topic		Replies	Views
Classification tweets by theme: How do i start? Beginners	5	680	March 7, 2022
Theme Extraction from Text Research	1	1858	December 29, 2023
NLP advise seeked for news processing Beginners	0	368	June 19, 2022
News topic classifier Intermediate	0	376	August 8, 2021
Project to Tweet 🤗 Course Projects	0	873	November 10, 2021