Loading SentencePiece tokenizer

When I use SentencePieceTrainer.train(), it returns a .model and .vocab file. However when trying to load it using AutoTokenizer.from_pretrained() it expects a .json file. How would I get a .json file from the .model and .vocab file?

Did you save the model first? You can do that using the save_pretrained() function, and then simply load the tokenizer by providing the model’s directory (where all the necessary files have been stored) to the from_pretrained() function.

@mapama247 i am actully stuck here.

i tried loading the sentencepiece trained tokenizer using the following script

tok=tokenizers.SentencePieceUnigramTokenizer.from_spm("tokenizer.model")
tok.save_pretrained("hf_format_tokenizer")

I get the following error:

AttributeError: 'SentencePieceUnigramTokenizer' object has no attribute 'save_pretrained'

Hi @StephennFernandes,
This is because in the previous message I was talking about the AutoTokenizer class from transformers. I cannot reproduce your problem because you’re loading from local, but I see that the SentencePieceUnigramTokenizer that you’re using has a couple methods that you could try: save and save_model. You can see them doing dir(tokenizers.SentencePieceUnigramTokenizer)