Here I created a new tokenizer
te_tokenizer = LlamaTokenizer.from_pretrained(‘/content/te_tokenizer’)
Now, I want to merge it with LLama Tokenizer. So
llama_tokenizer = LlamaTokenizer.from_pretrained(‘meta-llama/Llama-2-7b-chat-hf’)
I combined these two tokenizers and saved it locally and loaded it with Llama Tokenizer
new_te_tokenizer = LlamaTokenizer.from_pretrained(‘/content/extended_tokenizer’)
Now, I saved the tokenizer using .save_pretrinaed()
new_te_tokenizer.save_pretrained(‘telugu_llama2_tokenizer’)
Then pushed it to the hub. You can visit the tokenizer here:
The thing is, when I called the .save_pretrained(), it only produced okenizer_config.json, tokenizer.model and special_tokens_map.json. But I want the tokenizer.json file, which I can find in the original LlamaTokenizer repo.
And let’s say I want to load my tokenizer with the HuggingFace Tokenizer library. That library requires a tokenizer.json file to load it.
So the question is, how do I get the tokenizer.json for my new merged model?