Hello. I’m a beginner of NLP.
My question is why xlm-roberta-base Tokenizer split special symbol and under bar. For example,
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
tokenizer.tokenize(',')
>>> ['▁', ',']
tokenizer.tokenize('.')
>>> ['▁', '.']
tokenizer.tokenize('\\')
>>>['▁', '\\']
I check other special symbol but, It’s ok like ['_!']
. I don’t understand why these special symbol are splited. Please why and how solve this problem!