Classification tweets by theme: How do i start?

I have a 7m dataset of tweets and would like to know if them are talking about these topics: Economy, Social, Environmental.

A tweet can talk about one topic, two topics, all topics, or none of them.

Can someone give me a starting point? What model should i use? Should i label some data or are there any other NLP approach better than using transformers?

Hi, I would use BERT to start off with. Do you have any data labeled? Here is a good starting point here

I don’t have any labelled data yet. But i can label some of them, what is a ok number of data labelled in this case?

I would say at least 1k, but ideally 5k. Oof thats a lot to hand label. Maybe you can outsource it or use AWS Ground Truth

Yes BERT with Pytorch is a good starting point. Ideally you want labelled data, and unbalanced classes (this can be bit tricky) - good luck!

Or use a seeded topic model, which just requires drafting a list of keywords for each topic. See Bertopic here: Guided Topic Modeling - BERTopic

2 Likes