I have a special non-language use case using a fixed vocabulary—i.e., a relatively small set of generated tokens that represent the entire vocabulary of our “language.” I’d like to be able to use this with any of the different models and I’m wondering what would be the best approach? It’s just a
vocab.txt file of short strings, which I don’t think will work with any of the BPE tokenizers. Am I correct in that assumption? Also, is there a way to “force” a vocabulary onto any of the tokenizers?
Any help much appreciated.