Classification tweets by theme: How do i start?

thelara · March 2, 2022, 7:52pm

I have a 7m dataset of tweets and would like to know if them are talking about these topics: Economy, Social, Environmental.

A tweet can talk about one topic, two topics, all topics, or none of them.

Can someone give me a starting point? What model should i use? Should i label some data or are there any other NLP approach better than using transformers?

anwarika · March 2, 2022, 7:55pm

Hi, I would use BERT to start off with. Do you have any data labeled? Here is a good starting point here

thelara · March 2, 2022, 8:21pm

I don’t have any labelled data yet. But i can label some of them, what is a ok number of data labelled in this case?

anwarika · March 2, 2022, 8:29pm

I would say at least 1k, but ideally 5k. Oof thats a lot to hand label. Maybe you can outsource it or use AWS Ground Truth

haris525 · March 2, 2022, 9:16pm

Yes BERT with Pytorch is a good starting point. Ideally you want labelled data, and unbalanced classes (this can be bit tricky) - good luck!

ck37 · March 7, 2022, 12:23pm

Or use a seeded topic model, which just requires drafting a list of keywords for each topic. See Bertopic here: Guided Topic Modeling - BERTopic

Topic		Replies	Views
Build a Twitter topic extractor 🤗 Course Projects	7	3011	March 7, 2023
Unlabelled zero-shot-classification 🤗Transformers	1	469	May 26, 2023
Dataset for text classification Beginners	0	326	November 26, 2021
Supervised BERTopic with multiple topics per document Models	7	3563	November 16, 2023
Training Bert for unlabeled dataset Beginners	0	403	June 6, 2022

Classification tweets by theme: How do i start?

Related topics