Using a fixed vocabulary?

I have a special non-language use case using a fixed vocabulary—i.e., a relatively small set of generated tokens that represent the entire vocabulary of our “language.” I’d like to be able to use this with any of the different models and I’m wondering what would be the best approach? It’s just a vocab.txt file of short strings, which I don’t think will work with any of the BPE tokenizers. Am I correct in that assumption? Also, is there a way to “force” a vocabulary onto any of the tokenizers?

Any help much appreciated.

What if I just instantiate a tokenizer (e.g., BigBirdTokenizer), then used add_tokens() to add my entire vocabulary? That is, start with nothing, then “force” them in with the add_tokens() function… ??

UPDATE: Hmm… no… I can’t instantiate without a vocab file, so that won’t work…

In case anyone is having a similar problem (though I guess nobody is), I was able to get it loading by creating a folder with special_tokens_map.json and tokenizer_config.json, along with my vocab.txt, and passing it to the run_mlm.py script, using the tokenizer_name argument.

1 Like