When I use SentencePieceTrainer.train(), it returns a .model and .vocab file. However when trying to load it using AutoTokenizer.from_pretrained() it expects a .json file. How would I get a .json file from the .model and .vocab file?
Did you save the model first? You can do that using the save_pretrained() function, and then simply load the tokenizer by providing the model’s directory (where all the necessary files have been stored) to the from_pretrained() function.
@mapama247 i am actully stuck here.
i tried loading the sentencepiece trained tokenizer using the following script
tok=tokenizers.SentencePieceUnigramTokenizer.from_spm("tokenizer.model")
tok.save_pretrained("hf_format_tokenizer")
I get the following error:
AttributeError: 'SentencePieceUnigramTokenizer' object has no attribute 'save_pretrained'
Hi @StephennFernandes,
This is because in the previous message I was talking about the AutoTokenizer class from transformers. I cannot reproduce your problem because you’re loading from local, but I see that the SentencePieceUnigramTokenizer that you’re using has a couple methods that you could try: save
and save_model
. You can see them doing dir(tokenizers.SentencePieceUnigramTokenizer)