Why xlm-roberta Tokenizer split special symbol and under bar

Ssunbell · July 12, 2023, 11:25am

Hello. I’m a beginner of NLP.

My question is why xlm-roberta-base Tokenizer split special symbol and under bar. For example,

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
tokenizer.tokenize(',')
>>> ['▁', ',']

tokenizer.tokenize('.')
>>>  ['▁', '.']

tokenizer.tokenize('\\')
>>>['▁', '\\']

I check other special symbols but, It’s ok like ['_!']. I don’t understand why these special symbols are splited. Please why and how to solve this problem!

Topic		Replies	Views
Xlm-roberta Tokenizer split special symbol and under bar Beginners	0	171	July 11, 2023
Issue with XLM-RoBERTa tokenizer 🤗Tokenizers	1	301	August 15, 2024
RobertaTokenizer: How to enable masking of custom special tokens 🤗Transformers	1	977	April 24, 2021
Roberta pretokenizer - split punctuation? Beginners	2	210	March 30, 2024
What is based model of XLM-RoBERTa Tokenizer? SenetencePiece? XLNetTokenizer 🤗Tokenizers	0	32	September 12, 2024

Why xlm-roberta Tokenizer split special symbol and under bar

Related topics