Xlm-roberta Tokenizer split special symbol and under bar

Ssunbell · July 11, 2023, 11:59am

Hello. I’m a beginner of NLP.

My question is why xlm-roberta-base Tokenizer split special symbol and under bar. For example,

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
tokenizer.tokenize(',')
>>> ['▁', ',']

tokenizer.tokenize('.')
>>>  ['▁', '.']

tokenizer.tokenize('\\')
>>>['▁', '\\']

I check other special symbol but, It’s ok like ['_!']. I don’t understand why these special symbol are splited. Please why and how solve this problem!

Topic		Replies	Views
Why xlm-roberta Tokenizer split special symbol and under bar Intermediate	0	264	July 12, 2023
Issue with XLM-RoBERTa tokenizer 🤗Tokenizers	1	301	August 15, 2024
RobertaTokenizer: How to enable masking of custom special tokens 🤗Transformers	1	977	April 24, 2021
Roberta pretokenizer - split punctuation? Beginners	2	208	March 30, 2024
What is based model of XLM-RoBERTa Tokenizer? SenetencePiece? XLNetTokenizer 🤗Tokenizers	0	32	September 12, 2024

Xlm-roberta Tokenizer split special symbol and under bar

Related topics