Hi,
I want to fine-tune the language model of BERT/DistiLBERT (and later add a sequence classification head; the task is a kind of sentiment analysis). I have a mix of tweets, other social media posts, and speeches.
Now, I’m wondering what are necessary preprocessing steps?
I was thinking about removing urls, hashtags, and user mentions. Is this necessary? O shall I replace them with a special token?
Thanks,
Max