In the offcial document of XLM-RoBERTa:>
Adapted from RobertaTokenizer and XLNetTokenizer. Based on SentencePiece.
And each RobertaTokenizer and XLNetTokenizer are descripted below:
Constructs a RoBERTa tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding.
Construct an XLNet tokenizer. Based on SentencePiece.
And the description in article XLM-RoBERTa about tokenization is like below:
The different language-specific tokenization tools used by mBERT and XLM-100
make these models more difficult to use on raw text. Instead, we train a Sentence Piece model (SPM) and apply it directly on raw text data for all languages. We did not observe any loss in performance for models trained with SPM when compared to models trained with language-specific preprocessing and byte-pair encoding and hence use SPM for XLM-R.
Conneau, A. “Unsupervised cross-lingual representation learning at scale.” arXiv preprint arXiv:1911.02116 (2019).
I’m confused what is real XLM-RoBERTa Tokenizer builted. SentencePiece Tokenizer? XLNet Tokenizer? Or am I wrong?