Build a Twitter topic extractor

:wave: Please read the topic category description to understand what this is all about

Description

Twitter classifies trending tweets according to predefined set of topics like “Data science”, “Hip hop”, “Sport” etc. However, their algorithm appears to often get confused by certain keywords in the tweet or the content of the image (see here for some funny examples).

The goal of this project is to explore whether it’s possible to create a better topic extractor, or at least one that is more targeted for a smaller set of domains. There’s several ways to approach to task:

  • Frame it as a multiclass classification problem
  • Frame it as an unsupervised clustering problem, and combine this with techniques like UMAP and/or HDBSCAN

Model(s)

Since Tweets are short in length, picking one of the sentence-transformers models on the Hub is likely a good place to start.

Datasets

There are various Twitter datasets on the Hub and here’s a few examples to start with:

Challenges

If you take the unsupervised learning approach, be warned that there’s no “correct” answer and you will have to experiment with the dimensionality reduction / cluster algorithms to get meaningful clusters.

Desired project outcomes

  • Create a Streamlit or Gradio app on :hugs: Spaces that either can automatically classify a Tweet according to a topic, or visualises the 2D projection of the embeddings and colours them by topic.
  • Don’t forget to push all your models and datasets to the Hub so others can build on them!

Additional resources

Discord channel

To chat and organise with other people interested in this project, head over to our Discord and:

  • Follow the instructions on the #join-course channel

  • Join the #twitter-topic-extractor channel

Just make sure you comment here to indicate that you’ll be contributing to this project :slight_smile:

I’d like to work on this project! :rocket:

2 Likes

I’d like to join this project!

1 Like

I am interested as well. What is your timeframe?

I think that the project needs to be available for reviewal in November 24th.

1 Like

Hello Wilmer,

Would you like to have a zoom chat tomorrow (after the class)?

I am in Calif (-3 hours from you)
Al

Oh sorry didn’t see this, Yes, let’s catch up in discord :wink: