Using a fixed vocabulary?

I have a special non-language use case using a fixed vocabulary—i.e., a relatively small set of generated tokens that represent the entire vocabulary of our “language.” I’d like to be able to use this with any of the different models and I’m wondering what would be the best approach? It’s just a vocab.txt file of short strings, which I don’t think will work with any of the BPE tokenizers. Am I correct in that assumption? Also, is there a way to “force” a vocabulary onto any of the tokenizers?

Any help much appreciated.

What if I just instantiate a tokenizer (e.g., BigBirdTokenizer), then used add_tokens() to add my entire vocabulary? That is, start with nothing, then “force” them in with the add_tokens() function… ??

UPDATE: Hmm… no… I can’t instantiate without a vocab file, so that won’t work…