I have created a tokenizer and have merged it with the Llama 2 tokenizer, changing the vocab size from 32k to 38k. I have successfully saved the tokenizer using save_pretrained() and pushed it to the hub. And I’m also able to pull the tokenizer with AutoTokenizer and use it. But in the hub, I do not see any tokenizer.json file. The .save_pretrained() only generated a tokenizer_config.json, tokenizer.model and special_tokens_map.json
Here I created a new tokenizer
te_tokenizer = LlamaTokenizer.from_pretrained(‘/content/te_tokenizer’)
Now, I want to merge it with LLama Tokenizer. So
llama_tokenizer = LlamaTokenizer.from_pretrained(‘meta-llama/Llama-2-7b-chat-hf’)
I combined these two tokenizers and saved it locally and loaded it with Llama Tokenizer
new_te_tokenizer = LlamaTokenizer.from_pretrained(‘/content/extended_tokenizer’)
Now, I saved the tokenizer using .save_pretrinaed()
new_te_tokenizer.save_pretrained(‘telugu_llama2_tokenizer’)
Then pushed it to the hub. You can visit the tokenizer here:
The thing is, when I called the .save_pretrained(), it only produced okenizer_config.json, tokenizer.model and special_tokens_map.json. But I want the tokenizer.json file, which I can find in the original LlamaTokenizer repo.
And let’s say I want to load my tokenizer with the HuggingFace Tokenizer library. That library requires a tokenizer.json file to load it.
So the question is, how do I get the tokenizer.json for my new merged model?
# Slow tokenizers used to be saved in three separated files
SPECIAL_TOKENS_MAP_FILE = "special_tokens_map.json"
ADDED_TOKENS_FILE = "added_tokens.json"
TOKENIZER_CONFIG_FILE = "tokenizer_config.json"
# Fast tokenizers (provided by HuggingFace tokenizer's library) can be saved in a single file
FULL_TOKENIZER_FILE = "tokenizer.json"
_re_tokenizer_file = re.compile(r"tokenizer\.(.*)\.json")
Can you please check whether your new tokenizer is fast or not? tokenizer.is_fast should return True.