Save_pretrained() on tokenizer does not generate a tokenizer.json file

Akirami · August 17, 2024, 4:32pm

I have created a tokenizer and have merged it with the Llama 2 tokenizer, changing the vocab size from 32k to 38k. I have successfully saved the tokenizer using save_pretrained() and pushed it to the hub. And I’m also able to pull the tokenizer with AutoTokenizer and use it. But in the hub, I do not see any tokenizer.json file. The .save_pretrained() only generated a tokenizer_config.json, tokenizer.model and special_tokens_map.json

How do I get the tokenizer.json file?

mahmutc · August 19, 2024, 10:47am

hi @Akirami
Can you please share your code snippet? I don’t know if you need something similar to the following but for tokenizer rather than model.

Akirami · August 19, 2024, 12:15pm

Here I created a new tokenizer
te_tokenizer = LlamaTokenizer.from_pretrained(‘/content/te_tokenizer’)

Now, I want to merge it with LLama Tokenizer. So
llama_tokenizer = LlamaTokenizer.from_pretrained(‘meta-llama/Llama-2-7b-chat-hf’)

I combined these two tokenizers and saved it locally and loaded it with Llama Tokenizer
new_te_tokenizer = LlamaTokenizer.from_pretrained(‘/content/extended_tokenizer’)

Now, I saved the tokenizer using .save_pretrinaed()
new_te_tokenizer.save_pretrained(‘telugu_llama2_tokenizer’)

Then pushed it to the hub. You can visit the tokenizer here:

The thing is, when I called the .save_pretrained(), it only produced okenizer_config.json, tokenizer.model and special_tokens_map.json. But I want the tokenizer.json file, which I can find in the original LlamaTokenizer repo.

And let’s say I want to load my tokenizer with the HuggingFace Tokenizer library. That library requires a tokenizer.json file to load it.

So the question is, how do I get the tokenizer.json for my new merged model?

mahmutc · August 19, 2024, 2:00pm

From transformers/src/transformers/tokenization_utils_base.py at main · huggingface/transformers · GitHub

# Slow tokenizers used to be saved in three separated files
SPECIAL_TOKENS_MAP_FILE = "special_tokens_map.json"
ADDED_TOKENS_FILE = "added_tokens.json"
TOKENIZER_CONFIG_FILE = "tokenizer_config.json"

# Fast tokenizers (provided by HuggingFace tokenizer's library) can be saved in a single file
FULL_TOKENIZER_FILE = "tokenizer.json"
_re_tokenizer_file = re.compile(r"tokenizer\.(.*)\.json")

Can you please check whether your new tokenizer is fast or not?
tokenizer.is_fast should return True.

Topic		Replies	Views
Config.json is not saving after finetuning Llama 2 Beginners	9	3733	October 22, 2024
How to save my tokenizer using save_pretrained? Beginners	5	28967	August 13, 2021
"How to train a new language model from scratch using Transformers and Tokenizers" possibly requiring an update Site Feedback	4	2559	November 1, 2022
Loading pre-trained models with AddedTokens 🤗Transformers	2	747	October 14, 2024
Extending the tokenizer affects model generation Intermediate	3	165	December 19, 2024

Save_pretrained() on tokenizer does not generate a tokenizer.json file

Related topics