I have a special non-language use case using a fixed vocabulary—i.e., a relatively small set of generated tokens that represent the entire vocabulary of our “language.” I’d like to be able to use this with any of the different models and I’m wondering what would be the best approach? It’s just a
vocab.txt file of short strings, which I don’t think will work with any of the BPE tokenizers. Am I correct in that assumption? Also, is there a way to “force” a vocabulary onto any of the tokenizers?
Any help much appreciated.
What if I just instantiate a tokenizer (e.g., BigBirdTokenizer), then used
add_tokens() to add my entire vocabulary? That is, start with nothing, then “force” them in with the
add_tokens() function… ??
UPDATE: Hmm… no… I can’t instantiate without a vocab file, so that won’t work…
In case anyone is having a similar problem (though I guess nobody is), I was able to get it loading by creating a folder with
tokenizer_config.json, along with my
vocab.txt, and passing it to the
run_mlm.py script, using the