How to encode in fast mode using a local tokenizer?

mehran · May 3, 2023, 2:30am

I have trained a tokenizer locally and saved it into a JSON file. And I can use it to encode text like this:

import sys
from tokenizers import (
    normalizers,
    Tokenizer,
)


tokenizer = Tokenizer.from_file(sys.argv[1])
tokenizer.normalizer = normalizers.Sequence(
     normalizers.Strip()]
)

for line in sys.stdin:
    output = tokenizer.encode(line)
    print(" ".join(output.tokens))

This works just fine. But it seems to me it is not leveraging the fast (rust-based) implementation of the tokenizers. And I’m saying that because the instantiated tokenizer has no is_fast property.

For the sake of completeness, the tokenizer that I’m using is the Word Piece, even though I’ve also tested with BPE and Unigram.

How can I use the fast implementation of the tokenizers when I’m loading them from a local JSON file?

P.S. I understand that the fast tokenizers are beneficial only when text is dealt with in batches so the encoder can parallelize the process while in my example, the text is encoded one line at a time which cannot take advantage of the fast tokenizers anyway. Please ignore that.

Topic		Replies	Views
Update encode function slowTokenizer vs FastTokenizer 🤗Tokenizers	0	53	July 12, 2024
SentencePiece user_defined_symbols and fast tokenizers 🤗Tokenizers	1	1594	January 3, 2024
Difference between tokenizer and tokenizerfast Beginners	4	4259	December 22, 2023
Difference betweeen DistilBertTokenizerFast and DistilBertTokenizer? 🤗Transformers	2	3243	July 10, 2021
Batch encode plus in Rust Tokenizers 🤗Tokenizers	1	745	December 13, 2021

How to encode in fast mode using a local tokenizer?

Related topics