I have trained a tokenizer locally and saved it into a JSON file. And I can use it to encode text like this:
import sys from tokenizers import ( normalizers, Tokenizer, ) tokenizer = Tokenizer.from_file(sys.argv) tokenizer.normalizer = normalizers.Sequence( normalizers.Strip()] ) for line in sys.stdin: output = tokenizer.encode(line) print(" ".join(output.tokens))
This works just fine. But it seems to me it is not leveraging the fast (rust-based) implementation of the tokenizers. And I’m saying that because the instantiated tokenizer has no
For the sake of completeness, the tokenizer that I’m using is the Word Piece, even though I’ve also tested with BPE and Unigram.
How can I use the fast implementation of the tokenizers when I’m loading them from a local JSON file?
P.S. I understand that the fast tokenizers are beneficial only when text is dealt with in batches so the encoder can parallelize the process while in my example, the text is encoded one line at a time which cannot take advantage of the fast tokenizers anyway. Please ignore that.