(How) should I pre-process my data for a transformer model used for classification (sentiment analysis)?

aandyrea · December 29, 2022, 5:36pm

I understand that textual data often needs to be pre-processed before running a model on it. However, I’m not sure which pre-processing steps are appropriate (e.g., should I remove punctuation? should I make everything lowercase? should I lemmatize?)

Info about my particular case, in case it’s relevant:
I have a dataset of tweets in multiple languages. I want to use a multi-lingual model such as XLM-R to classify tweets according to their sentiment.

Topic		Replies	Views
Should I normalize text or not Beginners	4	1936	April 26, 2024
Text wrangling before classification 🤗Transformers	0	231	May 7, 2021
Preprocessing step for fine-tuning language model Beginners	1	848	March 12, 2021
Which strategy is better for text pre-processing in training a transformer model Beginners	0	235	January 2, 2022
Seeking Guidance on Creating and Training a Model with a Specific Dataset Beginners	4	498	February 2, 2024

(How) should I pre-process my data for a transformer model used for classification (sentiment analysis)?

Related topics