Documentation of SentencePieceBPETokenizer?

I cannot find SentencePieceBPETokenizer in the official documentation, therefore I ask this question.

I refer to the source code of SentencePieceBPETokenizer to write my codes to train a custom Tokenizer based on SentencePieceBPETokenizer.

Question 1: do the following codes make sense?

from tokenizers import SentencePieceBPETokenizer

special_tokens = ["<unk>", "<pad>", "<cls>", "<sep>", "<mask>"]
tokenizer = SentencePieceBPETokenizer()
tokenizer.train_from_iterator(
    get_training_corpus(),
    vocab_size = 400000,
    min_frequency = 10, 
    show_progress = True,
    special_tokens = special_tokens
)

where get_training_corpus() is a function to get sentences in batch from a list of strings.

I notice that there are Trainer classes, but there is no SentencePiece. Also, from the Tokenizer documentation, it specifies the attributes of train_from_iterator() are iterator and trainer object, but my attributes follow the Trainer class instead.

Question 2: No error occurred, can I assume the attributes are correct?

Question 3: What is the use of initial_alphabet and limit_alphabet? The documentation does not tell clearly. Any example?

p.s. I know show_progress = True does not work in Jupyter notebook.
p.s. using tokenizers 0.13.3

1 Like