I’ve trained a ByteLevelBPETokenizer
, which output two files: vocab.json
and merges.txt
. I want to use this tokenizer with an XLNet model.
When I tried to load this into an XLNetTokenizer
, I ran into an issue. The XLNetTokenizer
expects the vocab file to be a SentencePiece model:
VOCAB_FILES_NAMES = {"vocab_file": "spiece.model"}
I found this confusing, since it’s possible to train a ByteLevelBPETokenizer
then load it into a RobertaTokenizerFast
:
tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)
I looked at the source for the RobertaTokenizer, and the expected vocab files match the output of the ByteLevelBPETokenizer
:
VOCAB_FILES_NAMES = {
"vocab_file": "vocab.json",
"merges_file": "merges.txt",
}
Why do these tokenizers expect different vocab file formats? Are they intended to be used in different ways?