Load SentencePieceBPETokenizer in TF

jbmaxwell · April 27, 2022, 9:07pm

I have a tokenizer trained using SentencePieceBPETokenizer from the tokenizers library.

tokenizer = SentencePieceBPETokenizer(add_prefix_space=True)

# Customize training
tokenizer.train(
    files=paths, 
    vocab_size=4000, 
    min_frequency=2, 
    show_progress=True, 
    special_tokens=special_tokens,
)

# Saving model
tokenizer.save("sp/tokenizer.json")

It’s the first one I’ve trained that really fits the task (in particular because it fills the requested vocab_size ), so I’d like to use it everywhere I can.

I’m wondering whether it’s possible to load the tokenizer.json in the SentencepieceTokenizer from tensorflow-text (i.e., text.SentencepieceTokenizer())? Or, can it be loaded with PreTrainedTokenizerFast, then converted to the TF format? Obviously the vocab and merges are the crucial parts I want to utilize in TF.

Thanks in advance.

Topic		Replies	Views
Loading SentencePiece tokenizer Beginners	3	4993	October 24, 2023
Training sentencePiece from scratch? 🤗Tokenizers	8	19193	December 19, 2023
Documentation of SentencePieceBPETokenizer? 🤗Tokenizers	0	812	May 2, 2024
Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).` 🤗Tokenizers	11	19893	October 5, 2024
Tokenizer from tokenizers library cannot be used in transformers.Trainer 🤗Transformers	2	625	July 30, 2021

Load SentencePieceBPETokenizer in TF

Related topics