Preprocessing step for fine-tuning language model


I want to fine-tune the language model of BERT/DistiLBERT (and later add a sequence classification head; the task is a kind of sentiment analysis). I have a mix of tweets, other social media posts, and speeches.

Now, I’m wondering what are necessary preprocessing steps?

I was thinking about removing urls, hashtags, and user mentions. Is this necessary? O shall I replace them with a special token?


Since BERT works with a WordPiece tokenizer, I wouldn’t do any of that and see what happens, before you put effort into pre-processing.
Since your texts aren’t really domain-specific, you may see decent results for your “kind of sentiment analysis” without doing any pre-processing :wink:

1 Like