Loading SentencePiece tokenizer

mmukh · July 25, 2022, 7:42am

When I use SentencePieceTrainer.train(), it returns a .model and .vocab file. However when trying to load it using AutoTokenizer.from_pretrained() it expects a .json file. How would I get a .json file from the .model and .vocab file?

mapama247 · July 26, 2022, 9:09am

Did you save the model first? You can do that using the save_pretrained() function, and then simply load the tokenizer by providing the model’s directory (where all the necessary files have been stored) to the from_pretrained() function.

StephennFernandes · October 22, 2023, 4:51pm

@mapama247 i am actully stuck here.

i tried loading the sentencepiece trained tokenizer using the following script

tok=tokenizers.SentencePieceUnigramTokenizer.from_spm("tokenizer.model")
tok.save_pretrained("hf_format_tokenizer")

I get the following error:

AttributeError: 'SentencePieceUnigramTokenizer' object has no attribute 'save_pretrained'

mapama247 · October 24, 2023, 12:49pm

Hi @StephennFernandes,
This is because in the previous message I was talking about the AutoTokenizer class from transformers. I cannot reproduce your problem because you’re loading from local, but I see that the SentencePieceUnigramTokenizer that you’re using has a couple methods that you could try: save and save_model. You can see them doing dir(tokenizers.SentencePieceUnigramTokenizer)

Topic		Replies	Views
Training sentencePiece from scratch? 🤗Tokenizers	8	19238	December 19, 2023
Load SentencePieceBPETokenizer in TF 🤗Tokenizers	0	1001	April 27, 2022
Load pretrained model's tokenizer with or without vocabulary? Beginners	2	148	August 30, 2024
Using a BertWordPieceTokenizer trained from scratch from transformers 🤗Tokenizers	2	4992	March 26, 2021
Load tokenizer from vocab file that's been read into python Beginners	0	732	August 12, 2020

Loading SentencePiece tokenizer

Related topics