Using a fixed vocabulary?

jbmaxwell · October 11, 2021, 7:52pm

I have a special non-language use case using a fixed vocabulary—i.e., a relatively small set of generated tokens that represent the entire vocabulary of our “language.” I’d like to be able to use this with any of the different models and I’m wondering what would be the best approach? It’s just a vocab.txt file of short strings, which I don’t think will work with any of the BPE tokenizers. Am I correct in that assumption? Also, is there a way to “force” a vocabulary onto any of the tokenizers?

Any help much appreciated.

jbmaxwell · October 12, 2021, 12:38am

What if I just instantiate a tokenizer (e.g., BigBirdTokenizer), then used add_tokens() to add my entire vocabulary? That is, start with nothing, then “force” them in with the add_tokens() function… ??

UPDATE: Hmm… no… I can’t instantiate without a vocab file, so that won’t work…

jbmaxwell · November 8, 2021, 4:29pm

In case anyone is having a similar problem (though I guess nobody is), I was able to get it loading by creating a folder with special_tokens_map.json and tokenizer_config.json, along with my vocab.txt, and passing it to the run_mlm.py script, using the tokenizer_name argument.

Topic		Replies	Views
Using a fixed vocab.txt with AutoTokenizer? 🤗Tokenizers	1	2297	September 13, 2021
Trianing a model using predefined vocab Beginners	0	228	August 16, 2022
How to properly add new vocabulary to BPE tokenizers (like Roberta)? Beginners	3	5633	December 8, 2021
Load custom pretrained tokenizer 🤗Tokenizers	0	1609	October 28, 2021
Tokenizer tend to choose added tokens first rather than token in vocab 🤗Tokenizers	1	545	November 30, 2023

Using a fixed vocabulary?

Related topics