Text preprocessing for fitting Tokenizer model

Hello :slightly_smiling_face: I have read that when preprocessing text it is best practice to remove stop words, remove special characters and punctuation, to end up only with list of words. My question is: If the original text I want my tokenizer to be fitted on is a text containing a lot of statistics (hence a lot of % = / etc…) but also some texts, then it makes sense to keep special characters and numbers as input to the tokenizer model? Or it should be removed in any case as a tokenizer can only understand words?
Thanks a lot in advance :slightly_smiling_face:

I think this question may depends on what kind of task you might want to deal with. In briefly, as my opinion, I think you should keep those symbol in your vocab, because those symbol needs to be the element in your context, no matter what kind of task you might to deal with.

1 Like