Hello. I’m a beginner of NLP.
My question is why xlm-roberta-base Tokenizer split special symbol and under bar. For example,
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
tokenizer.tokenize(',')
>>> ['▁', ',']
tokenizer.tokenize('.')
>>> ['▁', '.']
tokenizer.tokenize('\\')
>>>['▁', '\\']
I check other special symbols but, It’s ok like ['_!']
. I don’t understand why these special symbols are splited. Please why and how to solve this problem!