I have a tokenizer trained using SentencePieceBPETokenizer from the tokenizers library.
tokenizer = SentencePieceBPETokenizer(add_prefix_space=True)
# Customize training
tokenizer.train(
files=paths,
vocab_size=4000,
min_frequency=2,
show_progress=True,
special_tokens=special_tokens,
)
# Saving model
tokenizer.save("sp/tokenizer.json")
It’s the first one I’ve trained that really fits the task (in particular because it fills the requested vocab_size
), so I’d like to use it everywhere I can.
I’m wondering whether it’s possible to load the tokenizer.json
in the SentencepieceTokenizer from tensorflow-text
(i.e., text.SentencepieceTokenizer()
)? Or, can it be loaded with PreTrainedTokenizerFast
, then converted to the TF format? Obviously the vocab and merges are the crucial parts I want to utilize in TF.
Thanks in advance.