Why do different tokenizers use different vocab files?

I’ve trained a ByteLevelBPETokenizer, which output two files: vocab.json and merges.txt. I want to use this tokenizer with an XLNet model.

When I tried to load this into an XLNetTokenizer, I ran into an issue. The XLNetTokenizer expects the vocab file to be a SentencePiece model:

VOCAB_FILES_NAMES = {"vocab_file": "spiece.model"}

I found this confusing, since it’s possible to train a ByteLevelBPETokenizer then load it into a RobertaTokenizerFast:

tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)

I looked at the source for the RobertaTokenizer, and the expected vocab files match the output of the ByteLevelBPETokenizer:

VOCAB_FILES_NAMES = {
    "vocab_file": "vocab.json",
    "merges_file": "merges.txt",
}

Why do these tokenizers expect different vocab file formats? Are they intended to be used in different ways?

2 Likes