I understand that textual data often needs to be pre-processed before running a model on it. However, I’m not sure which pre-processing steps are appropriate (e.g., should I remove punctuation? should I make everything lowercase? should I lemmatize?)
Info about my particular case, in case it’s relevant:
I have a dataset of tweets in multiple languages. I want to use a multi-lingual model such as XLM-R to classify tweets according to their sentiment.