Preprocessing step for fine-tuning language model

maxpower · March 12, 2021, 3:38pm

Hi,

I want to fine-tune the language model of BERT/DistiLBERT (and later add a sequence classification head; the task is a kind of sentiment analysis). I have a mix of tweets, other social media posts, and speeches.

Now, I’m wondering what are necessary preprocessing steps?

I was thinking about removing urls, hashtags, and user mentions. Is this necessary? O shall I replace them with a special token?

Thanks,
Max

neuralpat · March 12, 2021, 4:31pm

Since BERT works with a WordPiece tokenizer, I wouldn’t do any of that and see what happens, before you put effort into pre-processing.
Since your texts aren’t really domain-specific, you may see decent results for your “kind of sentiment analysis” without doing any pre-processing

Topic		Replies	Views
How preprocess texts including (mentions) Beginners	0	232	January 11, 2022
Data preprocessing steps for pretraining BERT from scratch Beginners	1	3852	January 30, 2022
Need advise for fine-tuning BERT on opinion mining Intermediate	0	486	October 25, 2021
Fine-tuning BERT Model on domain specific language Models	1	1797	January 5, 2021
(How) should I pre-process my data for a transformer model used for classification (sentiment analysis)? Beginners	0	435	December 29, 2022

Preprocessing step for fine-tuning language model

Related topics