Why do different tokenizers use different vocab files?

ajaykarpur · October 18, 2020, 1:53am

I’ve trained a ByteLevelBPETokenizer, which output two files: vocab.json and merges.txt. I want to use this tokenizer with an XLNet model.

When I tried to load this into an XLNetTokenizer, I ran into an issue. The XLNetTokenizer expects the vocab file to be a SentencePiece model:

VOCAB_FILES_NAMES = {"vocab_file": "spiece.model"}

I found this confusing, since it’s possible to train a ByteLevelBPETokenizer then load it into a RobertaTokenizerFast:

tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)

VOCAB_FILES_NAMES = {
    "vocab_file": "vocab.json",
    "merges_file": "merges.txt",
}

Why do these tokenizers expect different vocab file formats? Are they intended to be used in different ways?

Topic		Replies	Views
BartTokenizer with vocab.json and merge.txt which were created by ByteLevelBPETokenizer encode <s> into 3 tokens Beginners	1	5632	January 27, 2021
Loading local tokenizer (RobertaTokenizerFast.from_pretrained) 🤗Transformers	0	1627	June 14, 2023
[MarianTokenizer] Clarify the use of the vocab parameter 🤗Transformers	3	804	September 29, 2024
What is based model of XLM-RoBERTa Tokenizer? SenetencePiece? XLNetTokenizer 🤗Tokenizers	0	32	September 12, 2024
Two approaches to training a tokenizer Beginners	0	976	March 6, 2023