How to create a hugging face compatible tokenizer from a vocab file?

So far I have studied that we use BPETrainers, SentencePiece models to create tokenizers. In short these models learn patterns of tokenization from our corpus and creates a vocab file.

But I have created a vocab file which is a mapping file from str to int. e.g. { “a” : 1, “b” : 2 … }. I want to create a tokenizer which loads this mapping file and do encoding decoding stuff. Is there a way to achieve this?

1 Like

The models for tokenization have the option to initialize them with a vocab.json or a dictionary. So you can for example use

import tokenizers

model = tokenizers.models.WordPiece.fom_file(‘/path/to/vocab.json’)

There is different models, and the choice of model also influences how your input is split up into tokens. See Models .

Other than that see Building a tokenizer, block by block - Hugging Face LLM Course

1 Like