So far I have studied that we use BPETrainers, SentencePiece models to create tokenizers. In short these models learn patterns of tokenization from our corpus and creates a vocab file.
But I have created a vocab file which is a mapping file from str to int. e.g. { “a” : 1, “b” : 2 … }. I want to create a tokenizer which loads this mapping file and do encoding decoding stuff. Is there a way to achieve this?
1 Like
The models for tokenization have the option to initialize them with a vocab.json or a dictionary. So you can for example use
import tokenizers
model = tokenizers.models.WordPiece.fom_file(‘/path/to/vocab.json’)
There is different models, and the choice of model also influences how your input is split up into tokens. See Models .
Other than that see Building a tokenizer, block by block - Hugging Face LLM Course
1 Like