Save_pretrained() on tokenizer does not generate a tokenizer.json file

Akirami · August 19, 2024, 12:15pm

Here I created a new tokenizer
te_tokenizer = LlamaTokenizer.from_pretrained(‘/content/te_tokenizer’)

Now, I want to merge it with LLama Tokenizer. So
llama_tokenizer = LlamaTokenizer.from_pretrained(‘meta-llama/Llama-2-7b-chat-hf’)

I combined these two tokenizers and saved it locally and loaded it with Llama Tokenizer
new_te_tokenizer = LlamaTokenizer.from_pretrained(‘/content/extended_tokenizer’)

Now, I saved the tokenizer using .save_pretrinaed()
new_te_tokenizer.save_pretrained(‘telugu_llama2_tokenizer’)

Then pushed it to the hub. You can visit the tokenizer here:

The thing is, when I called the .save_pretrained(), it only produced okenizer_config.json, tokenizer.model and special_tokens_map.json. But I want the tokenizer.json file, which I can find in the original LlamaTokenizer repo.

And let’s say I want to load my tokenizer with the HuggingFace Tokenizer library. That library requires a tokenizer.json file to load it.

So the question is, how do I get the tokenizer.json for my new merged model?

Topic		Replies	Views
Saving tokenizer's configuration Beginners	1	2824	February 24, 2022
How to save a fast tokenizer using the transformer library and then load it using Tokenizers? 🤗Tokenizers	7	3482	December 14, 2022
How to get config.json from .pt loaded model? Beginners	1	1351	April 11, 2022
Can't save ConvBert tokenizer 🤗Tokenizers	1	1063	December 4, 2022
OOM issues with save_pretrained models 🤗Transformers	0	1057	March 9, 2021

Save_pretrained() on tokenizer does not generate a tokenizer.json file

Related topics